The Chrome sidebar now defends against prompt injection attacks. Three layers: XML-framed prompts with trust boundaries, a command allowlist that restricts bash to browse commands only, and Opus as the default model (harder to manipulate).
--model, --allowedTools, and other flags set by the server. Every server-side configuration change was silently dropped. Now uses the queued args.<user-message> tags with explicit instructions to treat content as data, not instructions. XML special characters (< > &) are escaped to prevent tag injection attacks.$B goto, $B click, $B snapshot, etc.). All other bash commands (curl, rm, cat, etc.) are forbidden. This prevents prompt injection from escalating to arbitrary code execution.docs/designs/ML_PROMPT_INJECTION_KILLER.md covering the follow-up ML classifier (DeBERTa, BrowseSafe-bench, Bun-native 5ms vision). P0 TODO for the next PR.Six fixes from community PRs and bug reports. The big one: your dependency tree is now pinned. Every bun install resolves the exact same versions, every time. No more floating ranges pulling fresh packages from npm on every setup.
bun.lock is committed and tracked. Every install resolves identical versions instead of floating ^ ranges from npm. Closes the supply-chain vector from #566.gstack-slug no longer crashes outside git repos. Falls back to directory name and "unknown" branch when there's no remote or HEAD. Every review skill that depends on slug detection now works in non-git contexts../setup no longer hangs in CI. The skill-prefix prompt now auto-selects short names after 10 seconds. Conductor workspaces, Docker builds, and unattended installs proceed without human input.'wx' string flag instead of numeric fs.constants that Bun compiled binaries don't handle on Windows./ship and /review find your design docs. Plan search now checks ~/.gstack/projects/ first, where /office-hours writes design documents. Previously, plan validation silently skipped because it was looking in the wrong directories./autoplan dual-voice actually works. Background subagents can't read files (Claude Code limitation), so the Claude voice was silently failing on every run. Now runs sequentially in foreground. Both voices complete before the consensus table.AI models now recommend instead of override. When Claude and Codex agree on a scope change, they present it to you instead of just doing it. Your direction is the default, not the models' consensus.
The browse server runs on localhost and requires a token for access, so these issues only matter if a malicious process is already running on your machine (e.g., a compromised npm postinstall script). This release hardens the attack surface so that even in that scenario, the damage is contained.
/health endpoint. Token now distributed via .auth.json file (0o600 permissions) instead of an unauthenticated HTTP response./refs and /activity/*. Removed wildcard origin header so websites can't read browse activity cross-origin.textContent instead of innerHTML. Prevents DOM injection if server-provided data ever contained markup. Standard defense-in-depth for browser extensions.validateReadPath now calls realpathSync and handles macOS /tmp symlink correctly./project-evil could match a freeze boundary set to /project.gstack-config rejects regex-special keys and escapes sed patterns. gstack-telemetry-log sanitizes branch/repo names in JSON output.gstack can generate real UI mockups. Not ASCII art, not text descriptions of hex codes, real visual designs you can look at, compare, pick from, and iterate on. Run /office-hours on a UI idea and you'll get 3 visual concepts in Chrome with a comparison board where you pick your favorite, rate the others, and tell the agent what to change.
$D). New compiled CLI wrapping OpenAI's GPT Image API. 13 commands: generate, variants, iterate, check, compare, extract, diff, verify, evolve, prompt, serve, gallery, setup. Generates pixel-perfect UI mockups from structured design briefs in ~40 seconds.$D compare generates a self-contained HTML page with all variants, star ratings, per-variant feedback, regeneration controls, a remix grid (mix layout from A with colors from B), and a Submit button. Feedback flows back to the agent via HTTP POST, not DOM polling./design-shotgun skill. Standalone design exploration you can run anytime. Generates multiple AI design variants, opens a comparison board in your browser, and iterates until you approve a direction. Session awareness (remembers prior explorations), taste memory (biases new generations toward your demonstrated preferences), screenshot-to-variants (screenshot what you don't like, get improvements), configurable variant count (3-8).$D serve command. HTTP server for the comparison board feedback loop. Serves the board on localhost, opens in your default browser, collects feedback via POST. Stateful: stays alive across regeneration rounds, supports same-tab reload via /api/progress polling.$D gallery command. Generates an HTML timeline of all design explorations for a project: every variant, feedback, organized by date.$D extract analyzes an approved mockup with GPT-4o vision and writes colors, typography, spacing, and layout patterns to DESIGN.md. Future mockups on the same project inherit the established visual language.$D diff compares two images and identifies differences by area with severity. $D verify compares a live site screenshot against an approved mockup, pass/fail gate.$D evolve takes a screenshot of your live site and generates a mockup showing how it should look based on your feedback. Starts from reality, not blank canvas.$D variants --viewports desktop,tablet,mobile generates mockups at multiple viewport sizes.$D prompt extracts implementation instructions from an approved mockup: exact hex colors, font sizes, spacing values, component structure. Zero interpretation gap.{{DESIGN_SHOTGUN_LOOP}} for the comparison board. Can generate "what 10/10 looks like" mockups when a design dimension rates below 7/10.{{DESIGN_SHOTGUN_LOOP}} for Phase 5 AI mockup review.design/src/ (16 files, ~2500 lines TypeScript)serve.ts (stateful HTTP server), gallery.ts (timeline generation)design/test/serve.test.ts (11 tests), design/test/gallery.test.ts (7 tests)docs/designs/DESIGN_TOOLS_V1.md{{DESIGN_SETUP}} (binary discovery), {{DESIGN_SHOTGUN_LOOP}} (shared comparison board loop for /design-shotgun, /plan-design-review, /design-consultation)Fixes 20 Socket alerts and 3 Snyk findings from the skills.sh security audit. Your skills are now cleaner, your telemetry is transparent, and 2,000 lines of dead code are gone.
$TEST_EMAIL / $TEST_PASSWORD env vars instead of test@example.com / password123. Cookie import section now has a safety note.gstack-telemetry-log binary only runs if telemetry is enabled AND the binary exists. Local JSONL logging always works, no binary needed.BUN_VERSION=1.3.10 and skip the download if bun is already installed.scripts/resolvers/*.ts. The RESOLVERS map is now the single source of truth with no shadow copies.test:audit script runs 6 regression tests that enforce all audit fixes stay in place.You can now choose how gstack skills appear: short names (/qa, /ship, /review) or namespaced (/gstack-qa, /gstack-ship). Setup asks on first run, remembers your preference, and switching is one command.
/qa, /ship) or namespaced (/gstack-qa, /gstack-ship). Short names are recommended. Your choice is saved to ~/.gstack/config.yaml and remembered across upgrades.--prefix flag. Complement to --no-prefix. Both flags persist your choice so you only decide once./ship suggesting /qa), it uses the right name for your install.gstack-config works on Linux. Replaced BSD-only sed -i '' with portable mktemp+mv. Config writes now work on GNU/Linux and WSL.~/.gstack/ was created earlier in setup. Fixed with a .welcome-seen sentinel file.Codex was wandering into ~/.claude/skills/ and following gstack's own instructions instead of reviewing your code. Now every codex prompt includes a boundary instruction that keeps it focused on the repository. Covers all 11 callsites across /codex, /autoplan, /review, /ship, /plan-eng-review, /plan-ceo-review, and /office-hours.
codex exec and codex review calls now prepend a filesystem boundary instruction telling Codex to ignore skill definition files. Prevents Codex from reading SKILL.md preamble scripts and wasting 8+ minutes on session tracking and upgrade checks.gstack-config, gstack-update-check, SKILL.md, skills/gstack), the /codex skill now warns and suggests a retry.Six community PRs landed in one batch. Install is faster, skills no longer collide with other tools, and you can cleanly uninstall gstack when needed.
bin/gstack-uninstall cleanly removes gstack from your system: stops browse daemons, removes all skill installs (Claude/Codex/Kiro), cleans up state. Supports --force (skip confirmation) and --keep-state (preserve config). (#323)subprocess.run(shell=True)), SSRF via LLM-generated URLs, stored prompt injection, async/sync mixing, and column name safety checks now fire automatically on Python projects. (#531)--single-branch --depth 1. Full history available for contributors. (#484)gstack- prefix. Skill symlinks are now gstack-review, gstack-ship, etc. instead of bare review, ship. Prevents collisions with other skill packs. Old symlinks are auto-cleaned on upgrade. Use --no-prefix to opt out. (#503)findPort() now uses net.createServer() instead of Bun.serve() for port probing, fixing an EADDRINUSE race on Windows where the polyfill's stop() is fire-and-forget. (#490)Skill scripts now work correctly in zsh. Previously, bash code blocks in skill templates used raw glob patterns like .github/workflows/*.yaml and ls ~/.gstack/projects/$SLUG/*-design-*.md that would throw "no matches found" errors in zsh when no files matched. Fixed 38 instances across 13 templates and 2 resolvers using two approaches: find-based alternatives for complex patterns, and setopt +o nomatch guards for simple ls commands.
.github/workflows/ globs replaced with find. cat .github/workflows/*deploy*, for f in .github/workflows/*.yml, and ls .github/workflows/*.yaml patterns in /land-and-deploy, /setup-deploy, /cso, and the deploy bootstrap resolver now use find ... -name instead of raw globs.~/.gstack/ and ~/.claude/ globs guarded with setopt. Design doc lookups, eval result listings, test plan discovery, and retro history checks across 10 skills now prepend setopt +o nomatch 2>/dev/null || true (no-op in bash, disables NOMATCH in zsh).ls jest.config.* vitest.config.* in the testing resolver now has a setopt guard.When you run gstack in Conductor with multiple workspaces open, Codex could silently review the wrong project. The codex exec -C flag resolved the repo root inline via $(git rev-parse --show-toplevel), which evaluates in whatever cwd the background shell inherits. In multi-workspace environments, that cwd might be a different project entirely.
codex exec commands across /codex, /autoplan, and 4 resolver functions now resolve _REPO_ROOT at the top of each bash block and reference the stored value in -C. No more inline evaluation that races with other workspaces.codex review also gets cwd protection. codex review doesn't support -C, so it now gets cd "$_REPO_ROOT" before invocation. Same class of bug, different command.|| pwd fallback silently used whatever random cwd was available. Now it errors out with a clear message if not in a git repo.scripts/resolvers/ months ago but never deleted. They had already diverged from the live versions and contained the old vulnerable pattern..tmpl, resolver .ts, and generated SKILL.md files for codex commands using inline $(git rev-parse --show-toplevel). Prevents reintroduction.Seven community contributions merged, reviewed, and tested. Plus security hardening for telemetry and review logging, and E2E test stability fixes.
.git, .vscode, etc.) are no longer picked up as skill templates./ship and /document-release now use the correct co-author line for Codex vs Claude../ no longer treated as CSS selectors. $B screenshot ./path/to/file.png now works instead of trying to find a CSS element.gen:skill-docs failure no longer blocks binary compilation.browse-basic, ship-base-branch, and review-dashboard-via tests now pass reliably by extracting only relevant SKILL.md sections instead of copying full 1900-line files into test fixtures.journey-think-bigger routing test. Never passed reliably because the routing signal was too ambiguous. 10 other journey tests cover routing with clear signals.The Chrome sidebar agent used to navigate to the wrong page when you asked it to do something. If you'd manually browsed to a site, the sidebar would ignore that and go to whatever Playwright last saw (often Hacker News from the demo). Now it works.
chrome.tabs.query() and sends it to the server. Previously the sidebar agent used Playwright's stale page.url(), which didn't update when you navigated manually in headed mode./connect-chrome now kills leftover sidebar-agent processes before starting a new one. Old agents had stale auth tokens and would silently fail, causing the sidebar to freeze./connect-chrome. Kills stale browse servers and cleans Chromium profile locks before connecting. Prevents "already connected" false positives after crashes./plan-eng-review automatically analyzes your plan for parallel execution opportunities. When your plan has independent workstreams, the review outputs a dependency table, parallel lanes, and execution order so you know exactly which tasks to split into separate git worktrees.
/plan-eng-review required outputs. Extracts a structured table of plan steps with module-level dependencies, computes parallel lanes, and flags merge conflict risks. Skips automatically for single-module or single-track plans.Three bugs in /codex caused 30+ minute hangs with zero output during plan reviews and adversarial checks. All three are fixed.
~/.claude/plans/. It would waste 10+ tool calls searching before giving up. Now the plan content is embedded directly in the prompt, and referenced source files are listed so Codex reads them immediately.PYTHONUNBUFFERED=1, python3 -u, and flush=True on every print call across all three Codex modes.xhigh (23x more tokens, known 50+ min hangs per OpenAI issues #8545, #8402, #6931) with per-mode defaults: high for review and challenge, medium for consult. Users can override with --xhigh flag when they want maximum reasoning.--xhigh override works in all modes. The override reminder was missing from challenge and consult mode instructions. Found by adversarial review.When you ship a branch with 12 commits spanning performance work, dead code removal, and test infra, the PR should mention all three. It wasn't. The CHANGELOG and PR summary biased toward whatever happened most recently, silently dropping earlier work.
Every gstack skill now has a voice. Not a personality, not a persona, but a consistent set of instructions that make Claude sound like someone who shipped code today and cares whether the thing works for real users. Direct, concrete, sharp. Names the file, the function, the command. Connects technical work to what the user actually experiences.
Two tiers: lightweight skills get a trimmed version (tone + writing rules). Full skills get the complete directive with context-dependent tone (YC partner energy for strategy, senior eng for code review, blog-post clarity for debugging), concreteness standards, humor calibration, and user-outcome guidance.
preamble.ts, injected via the template resolver. Tier 1 skills get a 4-line version. Tier 2+ skills get the full directive./plan-ceo-review, senior eng for /review, best-technical-blog-post for /investigate.The first time you run /land-and-deploy on a project, it does a dry run. It detects your deploy infrastructure, tests that every command works, and shows you exactly what will happen... before it touches anything. You confirm, and from then on it just works.
If your deploy config changes later (new platform, different workflow, updated URLs), it automatically re-runs the dry run. Trust is earned, maintained, and re-validated when the ground shifts.
Every click, fill, and select now waits for the page to settle before returning. No more stale snapshots because an XHR was still in-flight. Chain accepts pipe-delimited format for faster multi-step flows. You can save and restore browser sessions (cookies + open tabs). And iframe content is now reachable.
Network idle detection. click, fill, and select auto-wait up to 2s for network requests to settle before returning. Catches XHR/fetch triggered by interactions. Uses Playwright's built-in waitForLoadState('networkidle'), not a custom tracker.
$B state save/load. Save your browser session (cookies + open tabs) to a named file, load it back later. Files stored at .gstack/browse-states/{name}.json with 0o600 permissions. V1 saves cookies + URLs only (not localStorage, which breaks on load-before-navigate). Load replaces the current session, not merge.
$B frame command. Switch command context into an iframe: $B frame iframe, $B frame --name checkout, $B frame --url stripe, or $B frame @e5. All subsequent commands (click, fill, snapshot, etc.) operate inside the iframe. $B frame main returns to the main page. Snapshot shows [Context: iframe src="..."] header. Detached frames auto-recover.
Chain pipe format. Chain now accepts $B chain 'goto url | click @e5 | snapshot -ic' as a fallback when JSON parsing fails. Pipe-delimited with quote-aware tokenization.
getActiveFrameOrPage() checks isDetached() and auto-recovers.upload uses the frame-aware target for file input locators.You can now watch Claude work in a real Chrome window and direct it from a sidebar chat.
Headed mode with sidebar agent. $B connect launches a visible Chrome window with the gstack extension. The Side Panel shows a live activity feed of every command AND a chat interface where you type natural language instructions. A child Claude instance executes your requests in the browser ... navigate pages, click buttons, fill forms, extract data. Each task gets up to 5 minutes.
Personal automation. The sidebar agent handles repetitive browser tasks beyond dev workflows. Browse your kid's school parent portal and add parent contact info to Google Contacts. Fill out vendor onboarding forms. Extract data from dashboards. Log in once in the headed browser or import cookies from your real Chrome with /setup-browser-cookies.
Chrome extension. Toolbar badge (green=connected, gray=not), Side Panel with activity feed + chat + refs tab, @ref overlays on the page, and a connection pill showing which window gstack controls. Auto-loads when you run $B connect.
/connect-chrome skill. Guided setup: launches Chrome, verifies the extension, demos the activity feed, and introduces the sidebar chat.
Sidebar agent ungated. Previously required --chat flag. Now always available in headed mode. The sidebar agent has the same security model as Claude Code itself (Bash, Read, Glob, Grep on localhost).
Agent timeout raised to 5 minutes. Multi-page tasks (navigating directories, filling forms across pages) need more than the previous 2-minute limit.
/autoplan reviews now count toward the ship readiness gate. When /autoplan ran full CEO + Design + Eng reviews, /ship still showed "0 runs" for Eng Review because autoplan-logged entries weren't being read correctly. Now the dashboard shows source attribution (e.g., "CLEAR (PLAN via /autoplan)") so you can see exactly which tool satisfied each review./ship no longer tells you to "run /review first." Ship runs its own pre-landing review in Step 3.5 — asking you to run the same review separately was redundant. The gate is removed; ship just does it./land-and-deploy now checks all 8 review types. Previously missed review, adversarial-review, and codex-plan-review — if you only ran /review (not /plan-eng-review), land-and-deploy wouldn't see it./plan-ceo-review or /plan-eng-review. Now correctly maps to codex-plan-review entries./codex review now tracks staleness. Added the commit field to codex review log entries so the dashboard can detect when a codex review is outdated./autoplan no longer hardcodes "clean" status. Review log entries from autoplan used to always record status:"clean" even when issues were found. Now uses proper placeholder tokens that Claude substitutes with real values./retro and /ship. You can now run /ship on GitLab repos — it creates merge requests via glab mr create instead of gh pr create. /retro detects default branches on both platforms. All 11 skills using BASE_BRANCH_DETECT automatically get GitHub, GitLab, and git-native fallback detection.github.com or gitlab, gstack checks gh auth status / glab auth status to detect authenticated platforms — no manual config needed./document-release works on GitLab. After /ship creates a merge request, the auto-invoked /document-release reads and updates the MR body via glab instead of failing silently./land-and-deploy. Instead of silently failing on GitLab repos, /land-and-deploy now stops early with a clear message that GitLab merge support is not yet implemented.codex exec calls now explicitly set -C to the git root.stdio as an array (['ignore', 'ignore', 'ignore']), not a string ('ignore'). Fixes #448, #454, #458./ship and /review now actually enforce the quality gates they've been talking about. Coverage audit becomes a real gate (not just a diagram), plan completion gets verified against the diff, and verification steps from your plan run automatically.
## Test Coverage in CLAUDE.md.gstack-config set proactive true/false.SHA-256(hostname+username) approach meant anyone who knew your machine identity could compute your installation ID. Now uses a random UUID stored in ~/.gstack/installation-id — not derivable from any public input, rotatable by deleting the file.verify-rls.sh now correctly treats INSERT success as expected (kept for old client compat), handles 409 conflicts and 204 no-ops.gate (blocks PRs) or periodic (weekly cron + on-demand). Gate tests cover functional correctness and safety guardrails. Periodic tests cover expensive Opus quality benchmarks, non-deterministic routing tests, and tests requiring external services (Codex, Gemini). CI feedback is faster and cheaper while quality benchmarks still run weekly.gen-skill-docs.ts triggered all 56 E2E tests. Now only the ~27 tests that actually depend on it run. Same for llm-judge.ts, test-server.ts, worktree.ts, and the Codex/Gemini session runners. The truly global list is down to 3 files (session-runner, eval-store, touchfiles.ts itself).test:gate and test:periodic scripts replace test:e2e:fast. Use EVALS_TIER=gate or EVALS_TIER=periodic to filter tests by tier.GSTACK_SUPABASE_URL instead of GSTACK_TELEMETRY_ENDPOINT. Edge functions need the base URL, not the REST API path. The old variable is removed from config.sh.inserted count before advancing — if zero events were inserted, the cursor holds and retries next run.E2E_TIERS map in test/helpers/touchfiles.ts classifies every test — a free validation test ensures it stays in sync with E2E_TOUCHFILESEVALS_FAST / FAST_EXCLUDED_TESTS removed in favor of EVALS_TIERallow_failure removed from CI matrix (gate tests should be reliable).github/workflows/evals-periodic.yml runs periodic tests Monday 6 AM UTCsupabase/migrations/002_tighten_rls.sqlsupabase/verify-rls.sh (9 checks: 5 reads + 4 writes)test/telemetry.test.ts with field name verificationbrowse/dist/ binaries from git (arm64-only, rebuilt by ./setup)/plan-eng-review review report is now tested end-to-end — if it stops writing ## GSTACK REVIEW REPORT to the plan file, the test catches it./office-hours, /plan-ceo-review, /plan-design-review, and /plan-eng-review all check for Codex availability, prompt the user, and handle the fallback when Codex is unavailable.test/skill-e2e-plan.test.ts: plan-review-report, codex-offered-eng-review, codex-offered-ceo-review, codex-offered-office-hours, codex-offered-design-reviewtouchfiles to the documented global touchfile list in CLAUDE.md/browse users: the server process died when the CLI exited (Bun's unref() doesn't truly detach on Windows), the health check never ran because process.kill(pid, 0) is broken in Bun binaries on Windows, and Chromium's sandbox failed when spawned through the Bun→Node process chain. All three are now fixed. Credits to @fqueiro (PR #191) for identifying the detached: true approach.ensureServer() now tries an HTTP health check before falling back to PID-based detection — more reliable on every OS, not just Windows.~/.gstack/browse-startup-error.log so Windows users (who lose stderr due to process detachment) can debug.isServerHealthy() and startup error logging in browse/test/config.test.tsgit apply ~/.gstack-dev/harvests/<id>/gemini.patch to grab improvements.describeWithWorktree() helper. Any E2E test can now opt into worktree isolation with a one-line wrapper. Future tests that need real repo context (git history, real diff) can use this instead of tmpdirs.~/.gstack/projects/$SLUG/evals/ instead of the global ~/.gstack-dev/evals/. Multi-project users no longer get eval results mixed together.lib/worktree.ts) is a reusable platform module — future skills like /batch can import it directly.GLOBAL_TOUCHFILES updated so worktree infrastructure changes trigger all E2E tests.Every /autoplan phase now gets two independent second opinions — one from Codex (OpenAI's frontier model) and one from a fresh Claude subagent. Three AI reviewers looking at your plan from different angles, each phase building on the last.
[codex-only], [subagent-only], [single-reviewer mode]).10 community PRs merged — bug fixes, platform support, and workflow improvements.
BROWSE_EXTENSIONS_DIR to load Chrome extensions (ad blockers, accessibility tools, custom headers) into your browse testing sessions.setup --local installs gstack into .claude/skills/ in your current project instead of globally. Useful for per-project version pinning./office-hours, /plan-eng-review, /ship, and /review now check whether new CLI tools or libraries have a build/publish pipeline. No more shipping artifacts nobody can download.skill-check and gen-skill-docs automatically discover skills from the filesystem..gstack/ directory didn't exist, causing every invocation to think another process held the lock. Fixed by creating the state directory before lock acquisition.no matches found in zsh when no pending files exist.--force now actually forces upgrades. gstack-upgrade --force clears the snooze file, so you can upgrade immediately after snoozing.run: scalars that broke YAML parsing. Added actionlint CI workflow.Thanks to @osc, @Explorer1092, @Qike-Li, @francoisaubert1, @itstimwhite, @yinanli1917-cloud for contributions in this wave.
testConcurrentIfSelected. Wall clock drops from ~18min to ~6min — limited by the slowest individual test, not sequential sum.Dockerfile.ci) with pre-installed toolchain. Rebuilds automatically when Dockerfile or package.json changes, cached by content hash in GHCR..claude/skills/ instead of nested under .claude/skills/gstack/ — project-level skill discovery doesn't recurse into subdirectories.EVALS_CONCURRENCY=40 in CI for maximum parallelism (local default stays at 15)workflow_dispatch trigger for manual re-runs.agents/ to prevent stale files, and a one-time migration auto-cleans oversized descriptions on existing installs.package.json version now stays in sync with VERSION. Was 6 minor versions behind. A new CI test catches future drift.stderr is captured and checked.test/gen-skill-docs.test.ts validates all .agents/ descriptions stay within 1024 charsgstack-update-check includes a one-time migration that deletes oversized Codex SKILL.md files.pending-* glob pattern that triggered zsh's "no matches found" error on every invocation (the common case where no pending telemetry files exist). Replaced shell glob with find to avoid zsh's NOMATCH behavior entirely. Thanks to @hnshah for the initial report and fix in PR #332. Fixes #313.find instead of bare shell globs for .pending-* pattern matching./review now satisfies the ship readiness gate. Previously, running /review before /ship always showed "NOT CLEARED" because /review didn't log its result and /ship only looked for /plan-eng-review. Now /review persists its outcome to the review log, and all dashboards recognize both /review (diff-scoped) and /plan-eng-review (plan-stage) as valid Eng Review sources./ship suggests "run /review or /plan-eng-review" instead of only mentioning /plan-eng-review.REVIEW_DASHBOARD resolver instead of creating a duplicate ship-only resolver./cso v2 — start where the breaches actually happen. The security audit now begins with your infrastructure attack surface (leaked secrets in git history, dependency CVEs, CI/CD pipeline misconfigurations, unverified webhooks, Dockerfile security) before touching application code. 15 phases covering secrets archaeology, supply chain, CI/CD, LLM/AI security, skill supply chain, OWASP Top 10, STRIDE, and active verification.--daily runs a zero-noise scan with an 8/10 confidence gate (only reports findings it's highly confident about). --comprehensive does a deep monthly scan with a 2/10 bar (surfaces everything worth investigating).--diff mode scopes the audit to changes on your branch vs the base branch — perfect for pre-merge security checks..env files, unsigned webhooks, unpinned GitHub Actions, rootless Dockerfiles). All verified passing.grep in Bash; v2 uses Claude Code's native Grep tool for reliable results without truncation./plan-ceo-review or /plan-eng-review, you can get a "brutally honest outside voice" from a different AI model (Codex CLI, or a fresh Claude subagent if Codex isn't installed). It reads your plan, finds what the review missed — logical gaps, unstated assumptions, feasibility risks — and presents findings verbatim. Optional, recommended, never blocks shipping./ship now shows whether an outside voice ran on the plan, alongside the existing CEO/Eng/Design/Adversarial review rows./plan-eng-review Codex integration upgraded. The old hardcoded Step 0.5 is replaced with a richer resolver that adds Claude subagent fallback, review log persistence, dashboard visibility, and higher reasoning effort (xhigh)./plan-ceo-review or /plan-eng-review offer to run /office-hours first, it now runs inline in the same conversation. The review picks up right where it left off after the design doc is ready. Same for mid-session detection when you're still figuring out what to build.gstack-review-read and gstack-review-log no longer crash under bash. These scripts used source <(gstack-slug) which silently fails to set variables under bash with set -euo pipefail, causing SLUG: unbound variable errors. Replaced with eval "$(gstack-slug)" which works correctly in both bash and zsh.source <(gstack-slug) now uses eval "$(gstack-slug)" for cross-shell compatibility. Regenerated all SKILL.md files from templates.eval "$(gstack-slug)" works under bash strict mode, and guard against source <(.*gstack-slug patterns reappearing in templates or bin scripts./office-hours, you can opt in to a Codex cold read — a completely independent AI that hasn't seen the conversation reviews your problem, answers, and premises. It steelmans your idea, identifies the most revealing thing you said, challenges one premise, and proposes a 48-hour prototype. Two different AI models seeing different things catches blind spots neither would find alone.## Cross-Model Perspective section capturing what Codex said — so the independent view is preserved for downstream reviews./plan-design-review, /design-review, and /design-consultation dispatch both Codex (OpenAI) and a fresh Claude subagent in parallel to independently evaluate your design — then synthesize findings with a litmus scorecard showing where they agree and disagree. Cross-model agreement = high confidence; disagreement = investigate./ship and /review now includes a Codex design check when frontend files change — automatic, no opt-in needed.~/.codex/skills/gstack with only the assets Codex needs — no source files exposed.~/.codex/skills/gstack, setup detects this and moves it to ~/.gstack/repos/gstack so skills aren't discovered from the source checkout..agents/skills/gstack runtime asset directory was incorrectly symlinked alongside real skills — now skipped..agents/skills/gstack inside any repo and run ./setup --host codex — skills install next to the checkout, no global ~/.codex/ needed. Generated preambles auto-detect whether to use repo-local or global paths at runtime../setup --host kiro installs skills for the Kiro agent platform, rewriting paths and symlinking runtime assets. Auto-detected by --host auto if kiro-cli is installed..agents/ is now gitignored. Generated Codex skill files are no longer committed — they're created at setup time from templates. Removes 14,000+ lines of generated output from the repo.GSTACK_DIR renamed to SOURCE_GSTACK_DIR / INSTALL_GSTACK_DIR throughout the setup script for clarity about which path points to the source repo vs the install location..agents/ is no longer committed).GSTACK REVIEW REPORT section — even if you haven't run any formal reviews yet. Previously, this section only appeared after running /plan-eng-review, /plan-ceo-review, /plan-design-review, or /codex review. Now you always know where you stand: which reviews have run, which haven't, and what to do next./retro global — see everything you shipped across every project in one report. Scans your Claude Code, Codex CLI, and Gemini CLI sessions, traces each back to its git repo, deduplicates by remote, then runs a full retro across all of them. Global shipping streak, context-switching metrics, per-project breakdowns with personal contributions, and cross-tool usage patterns. Run /retro global 14d for a two-week view.gstack-global-discover — the engine behind global retro. Standalone discovery script that finds all AI coding sessions on your machine, resolves working directories to git repos, normalizes SSH/HTTPS remotes for dedup, and outputs structured JSON. Compiled binary ships with gstack — no bun runtime needed.1w, 2w) are now midnight-aligned like day windows, so /retro global 1w and /retro global 7d produce consistent results./cso — your Chief Security Officer. Full codebase security audit: OWASP Top 10, STRIDE threat modeling, attack surface mapping, data classification, and dependency scanning. Each finding includes severity, confidence score, a concrete exploit scenario, and remediation options. Not a linter — a threat model.browse storage now redacts secrets automatically. Tokens, JWTs, API keys, GitHub PATs, and Bearer tokens are detected by both key name and value prefix. You see [REDACTED — 42 chars] instead of the secret.browse goto now covers all three major cloud providers (AWS, GCP, Azure).gstack-slug hardened against shell injection. Output sanitized to alphanumeric, dot, dash, and underscore only. All remaining eval $(gstack-slug) callers migrated to source <(...).browse goto now resolves hostnames to IPs and checks against the metadata blocklist — prevents attacks where a domain initially resolves to a safe IP, then switches to a cloud metadata endpoint.keyboardShortcuts or monkeyPatch). Value detection expanded to cover AWS, Stripe, Anthropic, Google, Sendgrid, and Supabase key prefixes./autoplan now produces full-depth reviews instead of compressing everything to one-liners. When autoplan said "auto-decide," it meant "decide FOR the user using principles" — but the agent interpreted it as "skip the analysis entirely." Now autoplan explicitly defines the contract: auto-decide replaces your judgment, not the analysis. Every review section still gets read, diagrammed, and evaluated. You get the same depth as running each review manually./plan-eng-review, /ship, and /review via a single {{TEST_COVERAGE_AUDIT}} resolver. Plan mode adds missing tests to your plan before you write code. Ship mode auto-generates tests for gaps. Review mode finds untested paths during pre-landing review. One methodology, three contexts, zero copy-paste./review Step 4.75 — test coverage diagram. Before landing code, /review now traces every changed codepath and produces an ASCII coverage map showing what's tested (★★★/★★/★) and what's not (GAP). Gaps become INFORMATIONAL findings that follow the Fix-First flow — you can generate the missing tests right there./ship failure triage. When tests fail during ship, the coverage audit classifies each failure and recommends next steps instead of just dumping the error output.origin remote. The gstack-repo-mode helper now gracefully handles missing remotes, bare repos, and empty git output — defaulting to unknown mode instead of crashing the preamble.REPO_MODE defaults correctly when the helper emits nothing. Previously an empty response from gstack-repo-mode left REPO_MODE unset, causing downstream template errors./autoplan — one command, fully reviewed plan. Hand it a rough plan and it runs the full CEO → design → eng review pipeline automatically. Reads the actual review skill files from disk (same depth, same rigor as running each review manually) and makes intermediate decisions using 6 encoded principles: completeness, boil lakes, pragmatic, DRY, explicit over clever, bias toward action. Taste decisions (close approaches, borderline scope, codex disagreements) surface at a final approval gate. You approve, override, interrogate, or revise. Saves a restore point so you can re-run from scratch. Writes review logs compatible with /ship's dashboard./land-and-deploy — merge, deploy, and verify in one command. Takes over where /ship left off. Merges the PR, waits for CI and deploy workflows, then runs canary verification on your production URL. Auto-detects your deploy platform (Fly.io, Render, Vercel, Netlify, Heroku, GitHub Actions). Offers revert at every failure point. One command from "PR approved" to "verified in production."/canary — post-deploy monitoring loop. Watches your live app for console errors, performance regressions, and page failures using the browse daemon. Takes periodic screenshots, compares against pre-deploy baselines, and alerts on anomalies. Run /canary https://myapp.com --duration 10m after any deploy./benchmark — performance regression detection. Establishes baselines for page load times, Core Web Vitals, and resource sizes. Compares before/after on every PR. Tracks performance trends over time. Catches the bundle size regressions that code review misses./setup-deploy — one-time deploy configuration. Detects your deploy platform, production URL, health check endpoints, and deploy status commands. Writes the config to CLAUDE.md so all future /land-and-deploy runs are fully automatic./review now includes Performance & Bundle Impact analysis. The informational review pass checks for heavy dependencies, missing lazy loading, synchronous script tags, and bundle size regressions. Catches moment.js-instead-of-date-fns before it ships.--retry 2 on all E2E tests. Flaky tests get a second chance without masking real failures.test:e2e:fast tier. Excludes the 8 slowest Opus quality tests for quick feedback (~5-7 minutes). Run bun run test:e2e:fast for rapid iteration.first_response_ms, max_inter_turn_ms, and model used. Wall-clock timing shows whether parallelism is actually working.plan-design-review-plan-mode no longer races. Each test gets its own isolated tmpdir — no more concurrent tests polluting each other's working directory.ship-local-workflow no longer wastes 6 of 15 turns. Ship workflow steps are inlined in the test prompt instead of having the agent read the 700+ line SKILL.md at runtime.design-consultation-core no longer fails on synonym sections. "Colors" matches "Color", "Type System" matches "Typography" — fuzzy synonym-based matching with all 7 sections still required./plan-ceo-review, /plan-eng-review, /plan-design-review, /codex review), a markdown table is appended to the plan file itself — showing each review's trigger command, purpose, run count, status, and findings summary. Anyone reading the plan can see review status at a glance without checking conversation history./retro now surfaces these insights so you can see where your projects zigged while others zagged./office-hours adds Landscape Awareness phase. After understanding your problem through questioning but before challenging premises, gstack searches for what the world thinks — then runs a three-layer synthesis to find where conventional wisdom might be wrong for your specific case./plan-eng-review adds search check. Step 0 now verifies architectural patterns against current best practices and flags custom solutions where built-ins exist./investigate searches on hypothesis failure. When your first debugging hypothesis is wrong, gstack searches for the exact error message and known framework issues before guessing again./design-consultation three-layer synthesis. Competitive research now uses the structured Layer 1/2/3 framework to find where your product should deliberately break from category norms./office-hours. When /plan-ceo-review suggests running /office-hours first, it now saves a handoff note with your system audit findings and any discussion so far. When you come back and re-invoke /plan-ceo-review, it picks up that context automatically — no more starting from scratch./retro no longer nags about PR size. The retro still reports PR size distribution (Small/Medium/Large/XL) as neutral data, but no longer flags XL PRs as problems or recommends splitting them. AI reviews don't fatigue — the unit of work is the feature, not the diff./ship and /review. No more "want a second opinion?" prompt every time — Codex reviews both your code (with a pass/fail gate) and runs an adversarial challenge by default. First-time users get a one-time opt-in prompt; after that, it's hands-free. Configure with gstack-config set codex_reviews enabled|disabled.xhigh reasoning effort — when an AI is reviewing your code, you want it thinking as hard as possible..agents/skills/), the Codex review step is completely stripped — no accidental infinite loops./tmp paths and Unix-style path separators now use platform-aware equivalents via a new platform.ts module. Path traversal protection works correctly with Windows backslash separators.Bun.serve(), Bun.spawn(), Bun.spawnSync(), and Bun.sleep() equivalents. Fully tested.browse/scripts/build-node-server.sh transpiles the server for Node.js, stubs bun:sqlite, and injects the polyfill — all automated during bun run build.gemini -p). The gemini-discover-skill test confirms skill discovery from .agents/skills/, and gemini-review-findings runs a full code review via gstack-review. Both parse Gemini's stream-json NDJSON output and track token usage.parseGeminiJSONL handles all Gemini event types (init, message, tool_use, tool_result, result) with defensive parsing for malformed input. The parser is a pure function, independently testable without spawning the CLI.bun run test:gemini and bun run test:gemini:all scripts for running Gemini E2E tests independently. Gemini tests are also included in test:evals and test:e2e aggregate scripts./office-hours, an independent AI reviewer checks your design doc for completeness, consistency, clarity, scope creep, and feasibility — up to 3 rounds. You get a quality score (1-10) and a summary of what was caught and fixed. The doc you approve has already survived adversarial review./office-hours now generates a rough HTML wireframe using your project's design system (from DESIGN.md) and screenshots it. You see what you're designing while you're still thinking, not after you've coded it./plan-ceo-review and /plan-eng-review detect when you'd benefit from running /office-hours first and offer it — one-tap to switch, one-tap to decline. If you seem lost during a CEO review, it'll gently suggest brainstorming first.~/.gstack/analytics/spec-review.jsonl. Over time, you can see if your design docs are getting better./plan-ceo-review, /plan-eng-review, or /plan-design-review in plan mode, the review result wasn't saved to disk — so the dashboard showed stale or missing entries even though you just completed a review. Same issue affected telemetry logging at the end of every skill. Both now work reliably in plan mode.gstack now works on any AI agent that supports the open SKILL.md standard. Install once, use from Claude Code, OpenAI Codex CLI, Google Gemini CLI, or Cursor. All 21 skills are available in .agents/skills/ -- just run ./setup --host codex or ./setup --host auto and your agent discovers them automatically.
.claude/skills/, everything else reads from .agents/skills/. Same skills, same prompts, adapted for each host. Hook-based safety skills (careful, freeze, guard) get inline safety advisory prose instead of hooks -- they work everywhere../setup --host auto detects which agents you have installed and sets up both. Already have Claude Code? It still works exactly the same.~/.claude/ to ~/.codex/. The /codex skill itself is excluded from Codex output -- it's a Claude wrapper around codex exec, which would be self-referential.gstack-analytics to see a personal usage dashboard — which skills you use most, how long they take, your success rate. All data stays local on your machine.gstack-config set telemetry off.gstack-community-dashboard to see what the gstack community is building — most popular skills, crash clusters, version distribution. All powered by Supabase./retro now counts full calendar days. Running a retro late at night no longer silently misses commits from earlier in the day. Git treats bare dates like --since="2026-03-11" as "11pm on March 11" if you run it at 11pm — now we pass --since="2026-03-11T00:00:00" so it always starts from midnight. Compare mode windows get the same fix./. Branch names like garrytan/design-system caused review log writes to fail because Claude Code runs multi-line bash blocks as separate shell invocations, losing variables between commands. New gstack-review-log and gstack-review-read atomic helpers encapsulate the entire operation in a single command.bin/test-lane, RAILS_ENV, .includes(), rescue StandardError, etc.) from /ship, /review, /plan-ceo-review, and /plan-eng-review. The review checklist now shows examples for Rails, Node, Python, and Django side-by-side./ship reads CLAUDE.md to discover test commands instead of hardcoding bin/test-lane and npm run test. If no test commands are found, it asks the user and persists the answer to CLAUDE.md.## Testing section in CLAUDE.md for /ship test command discovery./ship now automatically syncs your docs. After creating the PR, /ship runs /document-release as Step 8.5 — README, ARCHITECTURE, CONTRIBUTING, and CLAUDE.md all stay current without an extra command. No more stale docs after shipping./codex (multi-AI second opinion), /careful (destructive command warnings), /freeze (directory-scoped edit lock), /guard (full safety mode), /unfreeze, and /gstack-upgrade. The sprint skill table keeps its 15 specialists; a new "Power tools" section covers the rest.$B handoff and $B resume for CAPTCHA/MFA/auth walls./codex, /careful, /freeze, /guard, /unfreeze, and /gstack-upgrade at the right workflow stages./plan-ceo-review, /plan-eng-review, or /plan-design-review, you get a recommendation for what to run next — eng review is always suggested as the required shipping gate, design review is suggested when UI changes are detected, and CEO review is softly mentioned for big product changes. No more remembering the workflow yourself.skip_eng_review respected everywhere. If you've opted out of eng review globally, the chaining recommendations won't nag you about it./review and /ship gets the same staleness tracking as full reviews.goto, diff, and newtab now block file://, javascript:, data: schemes and cloud metadata endpoints (169.254.169.254, metadata.google.internal). Localhost and private IPs are still allowed for local QA testing. (Closes #17)./setup without bun installed now shows a clear error with install instructions instead of a cryptic "command not found." (Closes #147)/debug renamed to /investigate. Claude Code has a built-in /debug command that shadowed the gstack skill. The systematic root-cause debugging workflow now lives at /investigate. (Closes #190)[a-zA-Z0-9._-] only, making both eval and source callers safe. (Closes #133)$B handoff "reason" and a visible Chrome opens at the exact same page with all your cookies and tabs intact. Solve the problem, tell Claude you're done, and $B resume picks up right where you left off with a fresh snapshot.handoff — so you don't waste time watching the AI retry a CAPTCHA.recreateContext() refactored to use shared saveState()/restoreState() helpers — same behavior, less code, ready for future state persistence features.browser.close() now has a 5-second timeout to prevent hangs when closing headed browsers on macOS./qa no longer refuses to use the browser on backend-only changes. Previously, if your branch only changed prompt templates, config files, or service logic, /qa would analyze the diff, conclude "no UI to test," and suggest running evals instead. Now it always opens the browser -- falling back to a Quick mode smoke test (homepage + top 5 navigation targets) when no specific pages are identified from the diff./codex — get an independent second opinion from a completely different AI.
Three modes. /codex review runs OpenAI's Codex CLI against your diff and gives a pass/fail gate — if Codex finds critical issues ([P1]), it fails. /codex challenge goes adversarial: it tries to find ways your code will fail in production, thinking like an attacker and a chaos engineer. /codex <anything> opens a conversation with Codex about your codebase, with session continuity so follow-ups remember context.
When both /review (Claude) and /codex review have run, you get a cross-model analysis showing which findings overlap and which are unique to each AI — building intuition for when to trust which system.
Integrated everywhere. After /review finishes, it offers a Codex second opinion. During /ship, you can run Codex review as an optional gate before pushing. In /plan-eng-review, Codex can independently critique your plan before the engineering review begins. All Codex results show up in the Review Readiness Dashboard.
Also in this release: Proactive skill suggestions — gstack now notices what stage of development you're in and suggests the right skill. Don't like it? Say "stop suggesting" and it remembers across sessions.
/qa and /design-review now ask what to do with uncommitted changes instead of refusing to start. When your working tree is dirty, you get an interactive prompt with three options: commit your changes, stash them, or abort. No more cryptic "ERROR: Working tree is dirty" followed by a wall of text./careful will warn you before any destructive command — rm -rf, DROP TABLE, force-push, kubectl delete, and more. You can override every warning. Common build artifact cleanups (rm -rf node_modules, dist, .next) are whitelisted./freeze. Debugging something and don't want Claude to "fix" unrelated code? /freeze blocks all file edits outside a directory you choose. Hard block, not just a warning. Run /unfreeze to remove the restriction without ending your session./guard activates both at once. One command for maximum safety when touching prod or live systems — destructive command warnings plus directory-scoped edit restrictions./debug now auto-freezes edits to the module being debugged. After forming a root cause hypothesis, /debug locks edits to the narrowest affected directory. No more accidental "fixes" to unrelated code during debugging.~/.gstack/analytics/skill-usage.jsonl. Run bun run analytics to see your top skills, per-repo breakdown, and how often safety hooks actually catch something. Data stays on your machine./retro shows which skills you used during the retro window alongside your usual commit analysis and metrics./retro date ranges now align to midnight instead of the current time. Running /retro at 9pm no longer silently drops the morning of the start date — you get full calendar days./retro timestamps now use your local timezone instead of hardcoded Pacific time. Users outside the US-West coast get correct local hours in histograms, session detection, and streak tracking./office-hours. Something's broken? It suggests /debug. Ready to deploy? It suggests /ship. Every workflow skill now has proactive triggers that fire when the moment is right./debug and /office-hours were completely invisible to natural language — no trigger phrases at all. Now both have full reactive + proactive triggers./office-hours — sit down with a YC partner before you write a line of code.
Two modes. If you're building a startup, you get six forcing questions distilled from how YC evaluates products: demand reality, status quo, desperate specificity, narrowest wedge, observation & surprise, and future-fit. If you're hacking on a side project, learning to code, or at a hackathon, you get an enthusiastic brainstorming partner who helps you find the coolest version of your idea.
Both modes write a design doc that feeds directly into /plan-ceo-review and /plan-eng-review. After the session, the skill reflects back what it noticed about how you think — specific observations, not generic praise.
/debug — find the root cause, not the symptom.
When something is broken and you don't know why, /debug is your systematic debugger. It follows the Iron Law: no fixes without root cause investigation first. Traces data flow, matches against known bug patterns (race conditions, nil propagation, stale cache, config drift), and tests hypotheses one at a time. If 3 fixes fail, it stops and questions the architecture instead of thrashing.
/ship, say "check my diff" and it finds /review. Following Anthropic's best practice: "the description field is not a summary — it's when to trigger."/plan-design-review is now interactive — rates 0-10, fixes the plan. Instead of producing a report with letter grades, the designer now works like CEO and Eng review: rates each design dimension 0-10, explains what a 10 looks like, then edits the plan to get there. One AskUserQuestion per design choice. The output is a better plan, not a document about the plan./plan-ceo-review detects UI scope in a plan, it activates a Design & UX section (Section 11) covering information architecture, interaction state coverage, AI slop risk, and responsive intention. For deep design work, it recommends /plan-design-review..todo). Added design-consultation to command validation./qa-design-review renamed to /design-review — the "qa-" prefix was confusing now that /plan-design-review is plan-mode. Updated across all 22 files./review and /ship apply a 20-item design checklist against changed CSS, HTML, JSX, and view files. Catches AI slop patterns (purple gradients, 3-column icon grids, generic hero copy), typography issues (body text < 16px, blacklisted fonts), accessibility gaps (outline: none), and !important abuse. Mechanical CSS fixes are auto-applied; design judgment calls ask you first.gstack-diff-scope categorizes what changed in your branch. Run source <(gstack-diff-scope main) and get SCOPE_FRONTEND=true/false, SCOPE_BACKEND, SCOPE_PROMPTS, SCOPE_TESTS, SCOPE_DOCS, SCOPE_CONFIG. Design review uses it to skip silently on backend-only PRs. Ship pre-flight uses it to recommend design review when frontend files are touched.outline: none, !important, purple gradient, generic hero copy, 3-column feature grid). The eval verifies /review catches at least 4 of 7./plan-ceo-review applies 14 cognitive patterns from Bezos (one-way doors, Day 1 proxy skepticism), Grove (paranoid scanning), Munger (inversion), Horowitz (wartime awareness), Chesky/Graham (founder mode), and Altman (leverage obsession). /plan-eng-review applies 15 patterns from Larson (team state diagnosis), McKinley (boring by default), Brooks (essential vs accidental complexity), Beck (make the change easy), Majors (own your code in production), and Google SRE (error budgets). /plan-design-review applies 12 patterns from Rams (subtraction default), Norman (time-horizon design), Zhuo (principled taste), Gebbia (design for trust, storyboard the journey), and Ive (care is visible).bun run test:e2e, it checks your diff and skips tests whose dependencies weren't touched. A branch that only changes /retro now runs 2 tests instead of 31. Use bun run test:e2e:all to force everything.bun run eval:select previews which tests would run. See exactly which tests your diff triggers before spending API credits. Supports --json for scripting and --base <branch> to override the base branch.testName in the E2E and LLM-judge test files has a corresponding entry in the TOUCHFILES map. New tests without entries fail bun test immediately — no silent always-run degradation.test:evals and test:e2e now auto-select based on diff (was: all-or-nothing)test:evals:all and test:e2e:all scripts for explicit full runsEvery gstack skill now follows the Completeness Principle: always recommend the full implementation when AI makes the marginal cost near-zero. No more "Choose B because it's 90% of the value" when option A is 70 lines more code.
Read the philosophy: https://garryslist.org/posts/boil-the-ocean
/review now flags shortcut implementations where the
complete version costs <30 min CC time/gstack-upgrade now catches stale vendored copies automatically. If your global gstack is up to date but the vendored copy in your project is behind, /gstack-upgrade detects the mismatch and syncs it. No more manually asking "did we vendor it?" — it just tells you and offers to update../setup fails while syncing a vendored copy, gstack restores the previous version from backup instead of leaving a broken install.gstack-upgrade/SKILL.md.tmpl now references Steps 2 and 4.5 (DRY) instead of duplicating detection/sync bash blocks. Added one new version-comparison bash block.|| true)./qa fixes a bug and verifies it, Phase 8e.5 automatically generates a regression test that catches the exact scenario that broke. Tests include full attribution tracing back to the QA report. Auto-incrementing filenames prevent collisions across sessions./ship Step 3.4 builds a code path map from your diff, searches for corresponding tests, and produces an ASCII coverage diagram with quality stars (★★★ = edge cases + errors, ★★ = happy path, ★ = smoke test). Gaps get tests auto-generated. PR body shows "Tests: 42 → 47 (+5 new)"./retro now shows total test files, tests added this period, regression test commits, and trend deltas. If test ratio drops below 20%, it flags it as a growth area./qa-design-review Phase 8e.5 skips CSS-only fixes (those are caught by re-running the design audit) but writes tests for JavaScript behavior changes like broken dropdowns or animation failures.generateTestBootstrap() resolver to gen-skill-docs.ts (~155 lines). Registered as {{TEST_BOOTSTRAP}} in the RESOLVERS map. Inserted into qa, ship (Step 2.5), and qa-design-review templates.qa/SKILL.md.tmpl (46 lines) and CSS-aware variant to qa-design-review/SKILL.md.tmpl (12 lines). Rule 13 amended to allow creating new test files.ship/SKILL.md.tmpl (88 lines) with quality scoring rubric and ASCII diagram format.retro/SKILL.md.tmpl: 3 new data gathering commands, metrics row, narrative section, JSON schema field.qa-only/SKILL.md.tmpl gets recommendation note when no test framework detected.qa-report-template.md gains Regression Tests section with deferred test specs.{{TEST_BOOTSTRAP}} and {{REVIEW_DASHBOARD}}./plan-eng-review no longer asks you to choose between "big change" and "small change" modes. Every plan gets the full interactive walkthrough (architecture, code quality, tests, performance). Scope reduction is only suggested when the complexity check actually triggers — not as a standing menu option./ship asks about missing reviews and you say "ship anyway" or "not relevant," that decision is saved for the branch. No more getting re-asked every time you re-run /ship after a pre-landing fix.plan-eng-review/SKILL.md.tmpl. Scope reduction is now proactive (triggered by complexity check) rather than a menu item.ship/SKILL.md.tmpl — writes ship-review-override entries to $BRANCH-reviews.jsonl so subsequent /ship runs skip the gate.You're always in control — even when dreaming big. /plan-ceo-review now presents every scope expansion as an individual decision you opt into. EXPANSION mode recommends enthusiastically, but you say yes or no to each idea. No more "the agent went wild and added 5 features I didn't ask for."
New mode: SELECTIVE EXPANSION. Hold your current scope as the baseline, but see what else is possible. The agent surfaces expansion opportunities one by one with neutral recommendations — you cherry-pick the ones worth doing. Perfect for iterating on existing features where you want rigor but also want to be tempted by adjacent improvements.
Your CEO review visions are saved, not lost. Expansion ideas, cherry-pick decisions, and 10x visions are now persisted to ~/.gstack/projects/{repo}/ceo-plans/ as structured design documents. Stale plans get archived automatically. If a vision is exceptional, you can promote it to docs/designs/ in your repo for the team.
Smarter ship gates. /ship no longer nags you about CEO and Design reviews when they're not relevant. Eng Review is the only required gate (and you can disable even that with gstack-config set skip_eng_review true). CEO Review is recommended for big product changes; Design Review for UI work. The dashboard still shows all three — it just won't block you for the optional ones.
plan-ceo-review/SKILL.md.tmpl with cherry-pick ceremony, neutral recommendation posture, and HOLD SCOPE baseline.status: ACTIVE/ARCHIVED/PROMOTED), scope decisions table, archival flow.docs/designs promotion step after Review Log.skip_eng_review config), CEO/Design optional with agent judgment./design-consultation doesn't just propose a safe, coherent system — it explicitly breaks down SAFE CHOICES (category baseline) vs. RISKS (where your product stands out). You pick which rules to break. Every risk comes with a rationale for why it works and what it costs./plan-ceo-review, /plan-eng-review, and /plan-design-review now logs its result to a review tracker. At the end of each review, you see a Review Readiness Dashboard showing which reviews are done, when they ran, and whether they're clean — with a clear CLEARED TO SHIP or NOT READY verdict./ship checks your reviews before creating the PR. Pre-flight now reads the dashboard and asks if you want to continue when reviews are missing. Informational only — it won't block you, but you'll know what you skipped.owner-repo from git remote) is now a shared bin/gstack-slug helper. All 14 inline copies across templates replaced with source <(gstack-slug). If the format ever changes, fix it once./tmp/browse-screenshot.png paths you can't see. Works in /qa, /qa-only, /plan-design-review, /qa-design-review, /browse, and /gstack.{{REVIEW_DASHBOARD}} resolver to gen-skill-docs.ts — shared dashboard reader injected into 4 templates (3 review skills + ship).bin/gstack-slug helper (5-line bash) with unit tests. Outputs SLUG= and BRANCH= lines, sanitizes / to -./merge skill for review-gated PR merge (P2)./plan-design-review opens your site and reviews it like a senior product designer — typography, spacing, hierarchy, color, responsive, interactions, and AI slop detection. Get letter grades (A-F) per category, a dual headline "Design Score" + "AI Slop Score", and a structured first impression that doesn't pull punches./qa-design-review runs the same designer's eye audit, then iteratively fixes design issues in your source code with atomic style(design): commits and before/after screenshots. CSS-safe by default, with a stricter self-regulation heuristic tuned for styling changes.DESIGN.md baseline. Finally know how many fonts you're actually using.design-baseline.json. Next run auto-compares: per-category grade deltas, new findings, resolved findings. Watch your design score improve over time.{{DESIGN_METHODOLOGY}} resolver to gen-skill-docs.ts — shared design audit methodology injected into both /plan-design-review and /qa-design-review templates, following the {{QA_METHODOLOGY}} pattern.~/.gstack-dev/plans/ as a local plans directory for long-range vision docs (not checked in). CLAUDE.md and TODOS.md updated./setup-design-md to TODOS.md (P2) for interactive DESIGN.md creation from scratch./review and /ship used to print informational findings (dead code, test gaps, N+1 queries) and then ignore them. Now every finding gets action: obvious mechanical fixes are applied automatically, and genuinely ambiguous issues are batched into a single question instead of 8 separate prompts. You see [AUTO-FIXED] file:line Problem → what was done for each auto-fix.review/checklist.md) so both /review and /ship stay in sync.$B js "const x = await fetch(...); return x.status" now works. The js command used to wrap everything as an expression — so const, semicolons, and multi-line code all broke. It now detects statements and uses a block wrapper, just like eval already did.@e3 [option] "Admin" in a snapshot and runs click @e3, gstack now auto-selects that option instead of hanging on an impossible Playwright click. The right thing just happens.<option> via CSS selector used to time out with a cryptic Playwright error. Now you get: "Use 'browse select' instead of 'click' for dropdown options."review/checklist.md — the canonical AUTO-FIX vs ASK classification.Fix-First Heuristic exists in checklist and is referenced by review + ship.needsBlockWrapper() and wrapForEvaluate() helpers in read-commands.ts — shared by both js and eval commands (DRY).getRefRole() to BrowserManager — exposes ARIA role for ref selectors without changing resolveRef return type.[role=option] refs to selectOption() via parent <select>, with DOM tagName check to avoid blocking custom listbox components./gstack-upgrade always checks for real. Running /gstack-upgrade directly now bypasses the cache and does a fresh check against GitHub. No more "you're already on the latest" when you're not.last-update-check cache TTL: 60 min for UP_TO_DATE, 720 min for UPGRADE_AVAILABLE.--force flag to bin/gstack-update-check (deletes cache file before checking).--force busts UP_TO_DATE cache, --force busts UPGRADE_AVAILABLE cache, 60-min TTL boundary test with utimesSync./document-release skill. Run it after /ship but before merging — it reads every doc file in your project, cross-references the diff, and updates README, ARCHITECTURE, CONTRIBUTING, CHANGELOG, and TODOS to match what you actually shipped. Risky changes get surfaced as questions; everything else is automatic._SESSIONS >= 3 conditional._BRANCH detection to preamble bash block (git branch --show-current with fallback).$B js "await fetch(...)" now just works. Any await expression in $B js or $B eval is automatically wrapped in an async context. No more SyntaxError: await is only valid in async functions. Single-line eval files return values directly; multi-line files use explicit return./ship, /review, /qa, and /plan-ceo-review detect which branch your PR actually targets instead of assuming main. Stacked branches, Conductor workspaces targeting feature branches, and repos using master all just work now./retro works on any default branch. Repos using master, develop, or other default branch names are detected automatically — no more empty retros because the branch name was wrong.{{BASE_BRANCH_DETECT}} placeholder for skill authors — drop it into any template and get 3-step branch detection (PR base → repo default → fallback) for free.hasAwait() helper with comment-stripping to avoid false positives on // await in eval files.(...), multi-line → block {...} with explicit return..tmpl files for git commands with hardcoded main.REPORT_DIR shell variable, simplified port detection to prose.gstack-config set gstack_contributor true) and gstack automatically writes up what went wrong — what you were doing, what broke, repro steps. Next time something annoys you, the bug report is already written. Fork gstack and fix it yourself.{{UPDATE_CHECK}} to {{PREAMBLE}} across all 11 skill templates — one startup block now handles update check, session tracking, contributor mode, and question formatting./qa-only) — report-only QA mode that finds and documents bugs without making fixes. Hand off a clean bug report to your team without the agent touching your code./qa now runs a find-fix-verify cycle: discover bugs, fix them, commit, re-navigate to confirm the fix took. One command to go from broken to shipped./plan-eng-review writes test-plan artifacts that /qa picks up automatically. Your engineering review now feeds directly into QA testing with no manual copy-paste.{{QA_METHODOLOGY}} DRY placeholder — shared QA methodology block injected into both /qa and /qa-only templates. Keeps both skills in sync when you update testing standards.generateCommentary() engine — interprets comparison deltas so you don't have to: flags regressions, notes improvements, and produces an overall efficiency summary.bun run eval:list now shows Turns and Duration per run. Spot expensive or slow runs instantly.bun run eval:summary shows average turns/duration/cost per test across runs. Identify which tests are costing you the most over time.judgePassed() unit tests — extracted and tested the pass/fail judgment logic.resolveRef() now checks element count to detect stale refs after page mutations. SPA navigation no longer causes 30-second timeouts on missing elements.formatComparison() now shows per-test turns and duration deltas alongside cost.printSummary() shows turns and duration columns.eval-store.test.ts fixed pre-existing _partial file assertion bug.bin/gstack-config CLI — simple get/set/list interface for ~/.gstack/config.yaml. Used by update-check and upgrade skill for persistent settings (auto_upgrade, update_check).update_check: false config option to disable checks entirely. Snooze resets when a new version is released.auto_upgrade: true in config or GSTACK_AUTO_UPGRADE=1 env var to skip the upgrade prompt and update automatically./gstack-upgrade now detects and updates local vendored copies in the current project after upgrading the primary install./gstack-upgrade instead of long paste commands.Write tool permission for config editing.TODO.md (roadmap) and TODOS.md (near-term) into one file organized by skill/component with P0-P4 priority ordering and a Completed section./ship Step 5.5: TODOS.md management — auto-detects completed items from the diff, marks them done with version annotations, offers to create/reorganize TODOS.md if missing or unstructured./plan-ceo-review, /plan-eng-review, /retro, /review, and /qa now read TODOS.md for project context. /retro adds Backlog Health metric (open counts, P0/P1 items, churn).review/TODOS-format.md — canonical TODO item format referenced by /ship and /plan-ceo-review to prevent format drift (DRY).greptile-triage.md for fixes (inline diff), already-fixed (what was done), and false positives (evidence + suggested re-rank). Replaces vague one-line replies.**Suggested re-rank:** when Greptile miscategorizes issue severity.TODOS-format.md references across skills..gitignore append failures silently swallowed — ensureStateDir() bare catch {} replaced with ENOENT-only silence; non-ENOENT errors (EACCES, ENOSPC) logged to .gstack/browse-server.log.TODO.md deleted — all items merged into TODOS.md./ship Step 3.75 and /review Step 5 now reference reply templates and escalation detection from greptile-triage.md./ship Step 6 commit ordering includes TODOS.md in the final commit alongside VERSION + CHANGELOG./ship Step 8 PR body includes TODOS section.screenshot command now supports element crop via CSS selector or @ref (screenshot "#hero" out.png, screenshot @e3 out.png), region clip (screenshot --clip x,y,w,h out.png), and viewport-only mode (screenshot --viewport out.png). Uses Playwright's native locator.screenshot() and page.screenshot({ clip }). Full page remains the default.~/.gstack-dev/e2e-live.json), per-run log directory (~/.gstack-dev/e2e-runs/{runId}/), progress.log, per-test NDJSON transcripts, persistent failure transcripts. All I/O non-fatal.bun run eval:watch — live terminal dashboard reads heartbeat + partial eval file every 1s. Shows completed tests, current test with turn/tool info, stale detection (>10min), --tail for progress.log.savePartial() writes _partial-e2e.json after each test completes. Crash-resilient: partial results survive killed runs. Never cleaned up.exit_reason, timeout_at_turn, last_tool_call fields in eval JSON. Enables jq queries for automated fix loops.is_error detection — claude -p can return subtype: "success" with is_error: true on API failures. Now correctly classified as error_api.parseNDJSON() pure function for real-time E2E progress from claude -p --output-format stream-json --verbose.~/.gstack-dev/evals/ with auto-comparison against previous run.eval:list, eval:compare, eval:summary for inspecting eval history..tmpl templates — plan-ceo-review, plan-eng-review, retro, review, ship now use {{UPDATE_CHECK}} placeholder. Single source of truth for update check preamble.claude -p (~$3.85/run), Tier 3: LLM-as-judge (~$0.15/run). Gated by EVALS=1.test/helpers/skill-parser.ts — getRemoteSlug() for git remote detection.find-browse indirection with explicit browse/dist/browse path in SKILL.md setup blocks.|| true to prevent non-zero exit when no update available.{{BROWSE_SETUP}} placeholder.{{UPDATE_CHECK}} and {{BROWSE_SETUP}} placeholders in gen-skill-docs.ts. All browse-using skills generate from single source of truth.generateHelpText() auto-generated from COMMAND_DESCRIPTIONS (replaces hand-maintained help text)..tmpl files with {{COMMAND_REFERENCE}} and {{SNAPSHOT_FLAGS}} placeholders, auto-generated from source code at build time. Structurally prevents command drift between docs and code.browse/src/commands.ts) — single source of truth for all browse commands with categories and enriched descriptions. Zero side effects, safe to import from build scripts and tests.SNAPSHOT_FLAGS array in browse/src/snapshot.ts) — metadata-driven parser replaces hand-coded switch/case. Adding a flag in one place updates the parser, docs, and tests.$B commands from SKILL.md code blocks, validates against command registry and snapshot flag metadataSKILL_E2E=1 env var (~$0.50/run)ANTHROPIC_API_KEYbun run skill:check — health dashboard showing all skills, command counts, validation status, template freshnessbun run dev:skill — watch mode that regenerates and validates SKILL.md on every template or source file change.github/workflows/skill-docs.yml) — runs gen:skill-docs on push/PR, fails if generated output differs from committed filesbun run gen:skill-docs script for manual regenerationbun run test:eval for LLM-as-judge evalstest/helpers/skill-parser.ts — extracts and validates $B commands from Markdowntest/helpers/session-runner.ts — Agent SDK wrapper with error pattern scanning and transcript savingconductor.json) — lifecycle hooks for workspace setup/teardown.env propagation — bin/dev-setup copies .env from main worktree into Conductor workspaces automatically.env.example template for API key configurationgen:skill-docs before compiling binariesparseSnapshotArgs is metadata-driven (iterates SNAPSHOT_FLAGS instead of switch/case)server.ts imports command sets from commands.ts instead of declaring inline.tmpl instead)jsonResponse() referenced url out of scope, crashing every API callhelp command routed correctly (was unreachable due to META_COMMANDS dispatch ordering)~/.claude/skills/gstack fallback from resolveServerScript()/tmp/ to .gstack//qa on a feature branch auto-analyzes git diff, identifies affected pages/routes, detects the running app on localhost, and tests only what changed. No URL needed..gstack/ inside the project root (detected via git rev-parse --show-toplevel). No more /tmp state files.browse/src/config.ts) — centralizes path resolution for CLI and server, eliminates duplicated port/state logicbinaryVersion SHA; CLI auto-restarts the server when the binary is rebuilt/tmp/browse-server*.json files, verifying PID ownership before sending signals/review and /ship fetch and triage Greptile bot comments; /retro tracks Greptile batting average across weeksbin/dev-setup symlinks skills from the repo for in-place development; bin/dev-teardown restores global installhelp command — agents can self-discover all commands and snapshot flagsfind-browse with META signal protocol — detects stale binaries and prompts agents to updatebrowse/dist/find-browse compiled binary with git SHA comparison against origin/main (4hr cached).version file written at build time for binary version tracking.gstack/browse.json (was /tmp/browse-server.json).gstack/browse-{console,network,dialog}.log (was /tmp/browse-*.log).json.tmp → rename (prevents partial reads)BROWSE_STATE_FILE to spawned server (server derives all paths from it)META:UPDATE_AVAILABLE/qa SKILL.md now describes four modes (diff-aware, full, quick, regression) with diff-aware as the default on feature branchesjsonResponse/errorResponse use options objects to prevent positional parameter confusionbrowse and find-browse binaries, cleans up .bun-build temp filesCONDUCTOR_PORT magic offset (browse_port = CONDUCTOR_PORT - 45600)~/.claude/skills/gstack/browse/src/server.tsDEVELOPING_GSTACK.md (renamed to CONTRIBUTING.md)cookie-import-browser command — decrypt and import cookies from real Chromium browsers (Comet, Chrome, Arc, Brave, Edge)--domain flag for non-interactive use/setup-browser-cookies skill for Claude Code integration/qa skill with 6-phase workflow (Initialize, Authenticate, Orient, Explore, Document, Wrap up)browse/bin/find-browse — DRY binary discovery using git rev-parse --show-toplevelupload <sel> <file1> [file2...]is visible|hidden|enabled|disabled|checked|editable|focused <sel>snapshot -a)snapshot -D)snapshot -C)wait --networkidle / --load / --domcontentloaded flagsconsole --errors filter (error + warning only)cookie-import <json-file> with auto-fill domain from page URL/browse installs — compiled binary now resolves server.ts from its own directory instead of assuming a global install existssetup rebuilds stale binaries (not just missing ones) and exits non-zero if the build failschain command swallowing real errors from write commands (e.g. navigation timeout reported as "Unknown meta command")ln -snf in setup to avoid creating nested symlinks on upgradegit fetch && git reset --hard instead of git pull for upgrades (handles force-pushes)/retro)Initial release.
/plan-ceo-review, /plan-eng-review, /review, /ship, /browsesetup script for binary compilation and skill symlinking