fix: Codex compatibility — 1024-char cap, duplicate skills, repo-local installs, kiro support (v0.11.2.0) (#346)
* fix: cap gstack skill descriptions for codex (#251)
Compresses SKILL.md.tmpl root description to <1024 chars (Codex token limit).
Adds description-length validation test. Includes /autoplan in compressed
skill list (added since PR was branched).
Co-authored-by: cweill <cweill@users.noreply.github.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix: skip sidecar dir in Codex skill linking (#269)
Adds guard to skip .agents/skills/gstack in link_codex_skill_dirs() —
it's a runtime asset sidecar, not a standalone skill. Prevents duplicate
skill discovery and symlink overwriting.
Fixes #261
Co-authored-by: mvanhorn <mvanhorn@users.noreply.github.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix: generate .agents directory at setup time instead of shipping duplicates (#308)
Removes 14K+ lines of committed generated Codex skill files from git.
.agents/ is now gitignored and generated at setup time via
`bun run gen:skill-docs --host codex`. Updates CI workflow to validate
generation instead of checking committed file freshness.
Co-authored-by: cskwork <cskwork@users.noreply.github.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix: avoid duplicate Codex skill discovery (#236)
Adds migrate_direct_codex_install() to move old direct installs from
~/.codex/skills/gstack to ~/.gstack/repos/gstack. Adds
create_codex_runtime_root() to expose only runtime assets (bin/, browse/,
review files) via symlinks instead of symlinking the entire repo.
Fixes #235
Co-authored-by: shichangs <shichangs@users.noreply.github.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* feat: support repo-local Codex installs (#317)
Changes gen-skill-docs.ts to use dynamic $GSTACK_ROOT/$GSTACK_BIN/$GSTACK_BROWSE
variables in generated Codex preambles instead of hardcoded ~/.codex/ paths.
Renames GSTACK_DIR → SOURCE_GSTACK_DIR/INSTALL_GSTACK_DIR throughout setup for
clarity. Supports both global (~/.codex/skills/) and repo-local (.agents/skills/)
Codex installs.
Co-authored-by: pengwk <pengwk@users.noreply.github.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* feat: add --host kiro support to setup script (#309)
Adds Kiro CLI as a supported agent platform. Setup detects kiro-cli,
copies+sed-rewrites SKILL.md paths from Codex/Claude to Kiro format,
and symlinks runtime assets (bin/, browse/).
Co-authored-by: AnshulDesai <AnshulDesai@users.noreply.github.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* test: add sidecar skip, GSTACK_ROOT, and kiro coverage (T1-T3)
Adds 3 tests identified during CEO/Eng review:
- T1: link_codex_skill_dirs() contains sidecar skip guard
- T2: generated Codex preambles use dynamic $GSTACK_ROOT paths
- T3: setup supports --host kiro with INSTALL_KIRO and sed rewrites
Also fixes existing test to expect kiro in --host case statement.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix: review fixes — ETHOS.md, runtime root, repo-local guard, kiro assets, upgrade paths
Paranoid 4-pass review found 7 issues, all fixed:
- Add ETHOS.md to create_codex_runtime_root
- Clean old real dirs (not just symlinks) on upgrade
- Skip runtime root for repo-local installs (prevent self-referential symlinks)
- Add review/, ETHOS.md, gstack-upgrade/ to Kiro install
- Update gstack-upgrade to detect ~/.gstack/repos/ and .agents/skills/
- Guard --host without value from silent exit
- Fix Kiro sed patterns + timeout instruction in gen-skill-docs.ts
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* chore: bump version and changelog (v0.11.2.0)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* chore: remove last tracked .agents/ file from git index
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: cweill <cweill@users.noreply.github.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-authored-by: mvanhorn <mvanhorn@users.noreply.github.com>
Co-authored-by: cskwork <cskwork@users.noreply.github.com>
Co-authored-by: shichangs <shichangs@users.noreply.github.com>
Co-authored-by: pengwk <pengwk@users.noreply.github.com>
Co-authored-by: AnshulDesai <AnshulDesai@users.noreply.github.com>
feat: plan files always show review status (v0.11.1.1) (#345)
* feat: plan files always show review status via preamble footer
Add Plan Status Footer to generateCompletionStatus() in the preamble.
When in plan mode before ExitPlanMode, Claude writes a GSTACK REVIEW
REPORT section to the plan file — either populated from review logs
or a "NO REVIEWS YET" placeholder. Skips if a review skill already
wrote a richer report.
* chore: bump version and changelog (v0.11.1.1)
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
feat: test coverage catalog — shared audit across plan/ship/review (v0.10.1.0) (#259)
* refactor: extract {{TEST_COVERAGE_AUDIT}} shared resolver
DRY extraction of the test coverage audit methodology into a shared
generator function with three explicit placeholders:
- TEST_COVERAGE_AUDIT_PLAN (plan-eng-review)
- TEST_COVERAGE_AUDIT_SHIP (ship)
- TEST_COVERAGE_AUDIT_REVIEW (review)
Shared across all modes: codepath tracing, ASCII diagram format,
quality scoring rubric, E2E test decision matrix, regression rule,
and test framework detection via CLAUDE.md.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* refactor: plan-eng-review uses shared test coverage audit
Replace the thin 6-line Section 3 test review with the full shared
methodology via {{TEST_COVERAGE_AUDIT_PLAN}}. Plan mode now:
- Traces every codepath with full ASCII diagrams
- Adds missing tests to the plan (not just "check for tests")
- Writes test plan artifact for /qa consumption
- Includes E2E/eval recommendations and regression detection
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* refactor: ship uses shared test coverage audit
Replace 135 lines of inline Step 3.4 methodology with
{{TEST_COVERAGE_AUDIT_SHIP}}. Functionally identical output plus:
- E2E test decision matrix (marks paths needing E2E vs unit)
- Eval recommendations for LLM prompt changes
- Regression detection iron rule
- Test framework detection via CLAUDE.md first
- Test plan artifact for /qa consumption
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* feat: /review Step 4.75 test coverage diagram
Add codepath tracing to the pre-landing review via
{{TEST_COVERAGE_AUDIT_REVIEW}}. Review mode:
- Produces ASCII coverage diagram (same methodology as plan/ship)
- Generates tests for gaps via Fix-First (ASK user)
- Subsumes Pass 2 "Test Gaps" checklist category
- Gaps are INFORMATIONAL findings
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* test: mode differentiation + regression guard for coverage audit
10 new tests verifying the three TEST_COVERAGE_AUDIT placeholders:
- All modes share: codepath tracing, E2E matrix, regression rule
- Plan mode: adds to plan + artifact, no ship-specific content
- Ship mode: auto-generates + before/after count + coverage summary
- Review mode: Fix-First ASK + INFORMATIONAL, no artifact
- Regression guard: ship SKILL.md preserves all key phrases
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* test: extract shared coverage audit fixture + review E2E
- Extract billing.ts fixture into coverage-audit-fixture.ts (DRY)
- Refactor ship-coverage-audit E2E to use shared fixture
- Add review-coverage-audit E2E for Step 4.75
- Update touchfiles: both E2Es depend on shared fixture
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix: strengthen E2E assertions for coverage audit tests
The coverage audit E2E tests (ship + review) were only asserting
exitReason === 'success' and readCalls > 0 — they passed even
if the agent produced no coverage diagram. Add assertion that
the output contains either GAP or TESTED markers.
Found during /review.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix: plan mode traces the plan, not the git diff
Codex adversarial review caught that plan-eng-review was inheriting
"git diff origin/<base>...HEAD" from the shared resolver, but plan mode
reviews a plan document, not a code diff. Plan mode now says:
"Trace every codepath in the plan" and "Read the plan document."
Ship and review modes keep the git diff instruction.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* chore: bump version and changelog (v0.9.5.0)
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* feat: test coverage catalog + failure triage (merged branches) (#285)
* feat: add bin/gstack-repo-mode — solo vs collaborative detection with caching
Detects whether a repo is solo-dev (one person does 80%+ of recent commits)
or collaborative. Uses 90-day git shortlog window with 7-day cache in
~/.gstack/projects/{SLUG}/repo-mode.json. Config override via
`gstack-config set repo_mode solo|collaborative` takes precedence over
the heuristic. Minimum 5 commits required to classify (otherwise unknown).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* feat: test failure ownership triage — see something say something
Adds two new preamble sections to all gstack skills:
- Repo Ownership Mode: explains solo vs collaborative behavior
- See Something, Say Something: proactive issue flagging principle
Adds {{TEST_FAILURE_TRIAGE}} template variable (opt-in, used by /ship):
- Classifies test failures as in-branch vs pre-existing
- Solo mode defaults to "investigate and fix now"
- Collaborative mode offers "blame + assign GitHub issue" option
- Also offers P0 TODO and skip options
/ship Step 3 now triages test failures instead of hard-stopping on all
failures. In-branch failures still block shipping. Pre-existing failures
get user-directed triage based on repo mode.
Adds P2 TODO for gstack notes system (deferred lightweight reminder).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* chore: regenerate SKILL.md files for Claude and Codex hosts
All 22 Claude skills and 21 Codex skills regenerated with new preamble
sections (Repo Ownership Mode, See Something Say Something) and
{{TEST_FAILURE_TRIAGE}} resolved in ship/SKILL.md.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix: validate repo mode values to prevent shell injection
Codex adversarial review found that unvalidated config/cache values
could be injected into shell via source <(gstack-repo-mode). Added
validate_mode() that only allows solo|collaborative|unknown — anything
else becomes "unknown". Prevents persistent code execution through
malicious config.yaml or tampered cache JSON.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix: shell injection via branch names + feature-branch sampling bias
Codex code review found two issues:
P1: eval $(gstack-slug) in gstack-repo-mode executes branch names as
shell. Branch names like foo$(touch${IFS}pwned) are valid git refs and
would execute arbitrary commands. Fix: compute SLUG directly with sed
instead of eval'ing gstack-slug output.
P2: git shortlog HEAD only sees current branch history. On feature
branches that haven't merged main recently, other contributors disappear
from the sample. Fix: use git shortlog on the default branch
(origin/main) instead of HEAD.
Also improved blame lookup in collaborative triage to check both the
test file and the production code it covers.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix: broaden codex-host stripping test to accommodate triage section
"Investigate and fix" now appears in TEST_FAILURE_TRIAGE (not just the
Codex review step). Use CODEX_REVIEWS config string as a more specific
marker for detecting the Codex review step in Codex-hosted skills.
* fix: replace template placeholder in TODOS.md with readable text
{{TEST_FAILURE_TRIAGE}} is template syntax but TODOS.md is not processed
by gen-skill-docs — replaced with human-readable reference.
* chore: bump version and changelog (v0.9.5.0)
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* docs: add bin/ directory to project structure in CLAUDE.md
* test: add triage resolver unit tests, plan-eng coverage audit E2E, and triage E2E
- TEST_FAILURE_TRIAGE resolver: 6 unit tests verifying all triage steps (T1-T4),
REPO_MODE branching, and safety default for ambiguous failures
- plan-eng-coverage-audit E2E: tests /plan-eng-review coverage audit codepath
(gap identified during eng review — existed on neither branch)
- ship-triage E2E: planted-bug fixture with in-branch (truncate null) and
pre-existing (divide-by-zero) failures; verifies correct classification
- Touchfile entries for diff-based test selection
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* chore: regenerate stale Codex SKILL.md for retro
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix: gstack-repo-mode handles repos without origin remote
Split `git remote get-url origin` into a separate variable with `|| true`
so the script doesn't crash under `set -euo pipefail` in local-only repos.
Falls back to REPO_MODE=unknown gracefully.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix: REPO_MODE defaults to unknown when helper emits nothing
Changed preamble from `source <(...) || REPO_MODE=unknown` (which doesn't
catch empty output) to `source <(...) || true` followed by
`REPO_MODE=${REPO_MODE:-unknown}`. Regenerated all SKILL.md files.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix: triage E2E runs both test files in subprocesses
math.test.js called process.exit(1) which killed the runner before
string.test.js could execute. Changed test runner to use child_process
so each test runs independently and both failure classes are exercised.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix: gstack-repo-mode handles repos without origin remote
Fall back through origin/main → origin/master → HEAD when
git symbolic-ref refs/remotes/origin/HEAD is not set. Prevents
shortlog crash in repos where origin/HEAD isn't configured.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix: triage E2E runs both test files in subprocesses
Add assertions verifying both math.test.js (pre-existing failure) and
string.test.js (in-branch failure) actually executed during triage.
Prevents false passes where only one failure class is exercised.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix: REPO_MODE defaults to unknown when helper emits nothing
- Remove head -20 truncation that biased solo classification by
dropping low-volume contributors from the denominator
- Use atomic write (mktemp + mv) for cache to prevent concurrent
preamble reads from seeing partial JSON
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* docs: add test coverage catalog to CHANGELOG + update project structure
- CHANGELOG: add 6 entries for coverage audit, review Step 4.75, E2E
recommendations, regression iron rule, failure triage, repo-mode fix
- CLAUDE.md: add missing skill directories (autoplan, benchmark, canary,
codex, land-and-deploy, setup-deploy) to project structure
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* chore: bump version and changelog (v0.10.1.0)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* docs: CHANGELOG rules — branch-scoped versions, never fold into old entries
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat: /land-and-deploy, /canary, /benchmark + perf review (v0.7.0) (#183)
* feat: add /canary, /benchmark, /land-and-deploy skills (v0.7.0)
Three new skills that close the deploy loop:
- /canary: standalone post-deploy monitoring with browse daemon
- /benchmark: performance regression detection with Web Vitals
- /land-and-deploy: merge PR, wait for deploy, canary verify production
Incorporates patterns from community PR #151.
Co-Authored-By: HMAKT99 <HMAKT99@users.noreply.github.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* feat: add Performance & Bundle Impact category to review checklist
New Pass 2 (INFORMATIONAL) category catching heavy dependencies
(moment.js, lodash full), missing lazy loading, synchronous scripts,
CSS @import blocking, fetch waterfalls, and tree-shaking breaks.
Both /review and /ship automatically pick this up via checklist.md.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* feat: add {{DEPLOY_BOOTSTRAP}} resolver + deployed row in dashboard
- New generateDeployBootstrap() resolver auto-detects deploy platform
(Vercel, Netlify, Fly.io, GH Actions, etc.), production URL, and
merge method. Persists to CLAUDE.md like test bootstrap.
- Review Readiness Dashboard now shows a "Deployed" row from
/land-and-deploy JSONL entries (informational, never gates shipping).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* chore: mark 3 TODOs completed, bump v0.7.0, update CHANGELOG
Superseded by /land-and-deploy:
- /merge skill — review-gated PR merge
- Deploy-verify skill
- Post-deploy verification (ship + browse)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* feat: /setup-deploy skill + platform-specific deploy verification
- New /setup-deploy skill: interactive guided setup for deploy configuration.
Detects Fly.io, Render, Vercel, Netlify, Heroku, Railway, GitHub Actions,
and custom deploy scripts. Writes config to CLAUDE.md with custom hooks
section for non-standard setups.
- Enhanced deploy bootstrap: platform-specific URL resolution (fly.toml app
→ {app}.fly.dev, render.yaml → {service}.onrender.com, etc.), deploy
status commands (fly status, heroku releases), and custom deploy hooks
section in CLAUDE.md for manual/scripted deploys.
- Platform-specific deploy verification in /land-and-deploy Step 6:
Strategy A (GitHub Actions polling), Strategy B (platform CLI: fly/render/heroku),
Strategy C (auto-deploy: vercel/netlify), Strategy D (custom hooks from CLAUDE.md).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* test: E2E + LLM-judge evals for deploy skills
- 4 E2E tests: land-and-deploy (Fly.io detection + deploy report),
canary (monitoring report structure), benchmark (perf report schema),
setup-deploy (platform detection → CLAUDE.md config)
- 4 LLM-judge evals: workflow quality for all 4 new skills
- Touchfile entries for diff-based test selection (E2E + LLM-judge)
- 460 free tests pass, 0 fail
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix: harden E2E tests — server lifecycle, timeouts, preamble budget, skip flaky
Cross-cutting fixes:
- Pre-seed ~/.gstack/.completeness-intro-seen and ~/.gstack/.telemetry-prompted
so preamble doesn't burn 3-7 turns on lake intro + telemetry in every test
- Each describe block creates its own test server instance instead of sharing
a global that dies between suites
Test fixes (5 tests):
- /qa quick: own server instance + preamble skip
- /review SQL injection: timeout 90→180s, maxTurns 15→20, added assertion
that review output actually mentions SQL injection
- /review design-lite: maxTurns 25→35 + preamble skip (now detects 7/7)
- ship-base-branch: both timeouts 90→150/180s + preamble skip
- plan-eng artifact: clean stale state in beforeAll, maxTurns 20→25
Skipped (4 flaky/redundant tests):
- contributor-mode: tests prompt compliance, not skill functionality
- design-consultation-research: WebSearch-dependent, redundant with core
- design-consultation-preview: redundant with core test
- /qa bootstrap: too ambitious (65 turns, installs vitest)
Also: preamble skip added to qa-only, qa-fix-loop, design-consultation-core,
and design-consultation-existing prompts. Updated touchfiles entries and
touchfiles.test.ts. Added honest comment to codex-review-findings.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* test: redesign 6 skipped/todo E2E tests + add test.concurrent support
Redesigned tests (previously skipped/todo):
- contributor-mode: pre-fail approach, 5 turns/30s (was 10 turns/90s)
- design-consultation-research: WebSearch-only, 8 turns/90s (was 45/480s)
- design-consultation-preview: preview HTML only, 8 turns/90s (was 30/480s)
- qa-bootstrap: bootstrap-only, 12 turns/90s (was 65/420s)
- /ship workflow: local bare remote, 15 turns/120s (was test.todo)
- /setup-browser-cookies: browser detection smoke, 5 turns/45s (was test.todo)
Added testConcurrentIfSelected() helper for future parallelization.
Updated touchfiles entries for all 6 re-enabled tests.
Target: 0 skip, 0 todo, 0 fail across all E2E tests.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix: relax contributor-mode assertions — test structure not exact phrasing
* perf: enable test.concurrent for 31 independent E2E tests
Convert 18 skill-e2e, 11 routing, and 2 codex tests from sequential
to test.concurrent. Only design-consultation tests (4) remain sequential
due to shared designDir state. Expected ~6x speedup on Teams high-burst.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix: add --concurrent flag to bun test + convert remaining 4 sequential tests
bun's test.concurrent only works within a describe block, not across
describe blocks. Adding --concurrent to the CLI command makes ALL tests
concurrent regardless of describe boundaries. Also converted the 4
design-consultation tests to concurrent (each already independent).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* perf: split monolithic E2E test into 8 parallel files
Split test/skill-e2e.test.ts (3442 lines) into 8 category files:
- skill-e2e-browse.test.ts (7 tests)
- skill-e2e-review.test.ts (7 tests)
- skill-e2e-qa-bugs.test.ts (3 tests)
- skill-e2e-qa-workflow.test.ts (4 tests)
- skill-e2e-plan.test.ts (6 tests)
- skill-e2e-design.test.ts (7 tests)
- skill-e2e-workflow.test.ts (6 tests)
- skill-e2e-deploy.test.ts (4 tests)
Bun runs each file in its own worker = 10 parallel workers
(8 split + routing + codex). Expected: 78 min → ~12 min.
Extracted shared helpers to test/helpers/e2e-helpers.ts.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* perf: bump default E2E concurrency to 15
* perf: add model pinning infrastructure + rate-limit telemetry to E2E runner
Default E2E model changed from Opus to Sonnet (5x faster, 5x cheaper).
Session runner now accepts `model` option with EVALS_MODEL env var override.
Added timing telemetry (first_response_ms, max_inter_turn_ms) and wall_clock_ms
to eval-store for diagnosing rate-limit impact. Added EVALS_FAST test filtering.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix: resolve 3 E2E test failures — tmpdir race, wasted turns, brittle assertions
plan-design-review-plan-mode: give each test its own tmpdir to eliminate
race condition where concurrent tests pollute each other's working directory.
ship-local-workflow: inline ship workflow steps in prompt instead of having
agent read 700+ line SKILL.md (was wasting 6 of 15 turns on file I/O).
design-consultation-core: replace exact section name matching with fuzzy
synonym-based matching (e.g. "Colors" matches "Color", "Type System"
matches "Typography"). All 7 sections still required, LLM judge still hard fail.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* perf: pin quality tests to Opus, add --retry 2 and test:e2e:fast tier
~10 quality-sensitive tests (planted-bug detection, design quality judge,
strategic review, retro analysis) explicitly pinned to Opus. ~30 structure
tests default to Sonnet for 5x speed improvement.
Added --retry 2 to all E2E scripts for flaky test resilience.
Added test:e2e:fast script that excludes 8 slowest tests for quick feedback.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* docs: mark E2E model pinning TODO as shipped
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* docs: add SKILL.md merge conflict directive to CLAUDE.md
When resolving merge conflicts on generated SKILL.md files, always merge
the .tmpl templates first, then regenerate — never accept either side's
generated output directly.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix: add DEPLOY_BOOTSTRAP resolver to gen-skill-docs
The land-and-deploy template referenced {{DEPLOY_BOOTSTRAP}} but no resolver
existed, causing gen-skill-docs to fail. Added generateDeployBootstrap() that
generates the deploy config detection bash block (check CLAUDE.md for persisted
config, auto-detect platform from config files, detect deploy workflows).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* chore: regenerate SKILL.md files after DEPLOY_BOOTSTRAP fix
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix: move prompt temp file outside workingDirectory to prevent race condition
The .prompt-tmp file was written inside workingDirectory, which gets deleted
by afterAll cleanup. With --concurrent --retry, afterAll can interleave with
retries, causing "No such file or directory" crashes at 0s (seen in
review-design-lite and office-hours-spec-review).
Fix: write prompt file to os.tmpdir() with a unique suffix so it survives
directory cleanup. Also convert review-design-lite from describeE2E to
describeIfSelected for proper diff-based test selection.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix: add --retry 2 --concurrent flags to test:evals scripts for consistency
test:evals and test:evals:all were missing the retry and concurrency flags
that test:e2e already had, causing inconsistent behavior between the two
script families.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: HMAKT99 <HMAKT99@users.noreply.github.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat: auto-scaled adversarial review (v0.9.5.0) (#297)
* feat: auto-scaled adversarial review by diff size
Replace config-driven Codex review step with automatic adversarial review
that scales by diff size: small (<50 lines) skips adversarial, medium
(50-199) gets cross-model adversarial, large (200+) gets all 4 passes.
Adds Claude adversarial subagent as fallback when Codex unavailable.
Review log uses new "adversarial-review" skill name with source/tier fields.
Dashboard updated to read both adversarial-review and legacy codex-review.
* chore: regenerate SKILL.md files for auto-scaled adversarial review
* chore: bump version and changelog (v0.9.5.0)
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* feat: allow user override of adversarial review tier
Users can now say "paranoid review", "run all passes", "full adversarial",
etc. to force the large tier regardless of diff size.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
feat: Search Before Building — builder ethos + skill integrations (v0.9.5.0) (#298)
* feat: ETHOS.md — gstack builder philosophy
Standalone document capturing the four principles: The Golden Age,
Boil the Lake, Search Before Building, and Build for Yourself.
Introduces the three-layer knowledge framework (tried-and-true,
new-and-popular, first-principles) and the Eureka Moment concept —
when first-principles reasoning reveals conventional wisdom is wrong.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* feat: Search Before Building preamble section + CLAUDE.md
Add generateSearchBeforeBuildingSection(ctx) to gen-skill-docs.ts.
Every workflow skill now gets a compact router section covering:
- Three layers of knowledge (tried-and-true, new-and-popular, first-principles)
- Eureka moment format and jq-based JSONL logging
- WebSearch fallback clause
- ETHOS.md reference via ctx.paths.skillRoot resolver
Also adds compact "Search before building" section to CLAUDE.md.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* feat: skill-specific Search Before Building integrations
8 template changes:
- /office-hours: Phase 2.75 Landscape Awareness (WebSearch + three-layer synthesis)
- /plan-eng-review: Step 0 search check with layer provenance annotations
- /investigate: external pattern search + search escalation on hypothesis failure
- /plan-ceo-review: Landscape Check before scope challenge
- /review: search-before-recommending for fix patterns
- /qa-only: WebSearch in allowed-tools
- /design-consultation: three-layer synthesis backport in Phase 2 Step 3
- /retro: eureka moment tracking from ~/.gstack/analytics/eureka.jsonl
All search steps include WebSearch fallback clause.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* feat: v0.9.5.0 — Builder Ethos (CHANGELOG + VERSION + TODOS)
ETHOS.md + Search Before Building across all workflow skills.
Deferred: first-time intro flow (blocked on blog post).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix: address Codex review — sanitize search, privacy gate, ETHOS.md sidecar
Three fixes from adversarial Codex review:
- /investigate: sanitize error messages before searching (strip hostnames,
IPs, file paths, SQL, customer data). Skip search if unsanitizable.
- /office-hours: add privacy gate before landscape search. Use generalized
category terms, never the user's specific product name or stealth idea.
- setup: link ETHOS.md into .agents/skills/gstack/ sidecar so workspace-
local Codex sessions can find the builder philosophy.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix: sanitize Phase 2 external pattern search in /investigate
The Phase 2 external search also sent raw error messages to WebSearch.
Apply same sanitization rule as Phase 3 search escalation.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* docs: sync documentation with shipped changes
- ARCHITECTURE.md: preamble now handles 5 things (add Search Before Building)
- CLAUDE.md: add ETHOS.md to project structure tree
- README.md: add ETHOS.md to docs table
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat: default codex reviews in /ship and /review (v0.9.4.0) (#256)
* feat: default codex reviews in /ship and /review with xhigh reasoning
Codex code reviews are now opt-in-once-then-always-on via a one-time
adoption prompt. When enabled, both review + adversarial run automatically
on every /ship and /review — no more choosing between them.
Key changes:
- New {{CODEX_REVIEW_STEP}} resolver centralizes Codex review logic (DRY)
- Three-state config: enabled/not-set/disabled via gstack-config
- P1 findings default to "Investigate and fix" instead of "Ship anyway"
- All reasoning bumped to xhigh (review, adversarial, consult)
- Codex review step stripped from codex-host variants (no self-invocation)
- Ship "Never ask" rule updated to accurately list quality-gate stops
- Error handling for auth, timeout, empty response (all non-blocking)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix: update touchfiles test for plan-ceo-review-benefits dependency
The merge from main added plan-ceo-review-benefits to E2E_TOUCHFILES,
which means plan-ceo-review/SKILL.md now selects 3 tests, not 2.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* feat: default codex reviews in /ship and /review (v0.9.4.0)
Codex code reviews now run automatically — both review + adversarial
challenge — with a one-time opt-in prompt for new users. All modes use
xhigh reasoning. Codex-host builds strip the step to prevent recursion.
Fixes from Codex review: TMPERR properly defined, stderr captured for
both review and adversarial, error handling before log persist, commit
hash included in review log for staleness tracking.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: plan mode exception for review log + telemetry writes (v0.9.0.1) (#234)
* fix: plan mode exception for review log + telemetry writes
Add explicit plan-mode exception notes to review log sections in all
3 plan review skill templates and the telemetry section in gen-skill-docs.ts.
When Claude runs in plan mode, it self-censors bash writes — but review
logging and telemetry write to ~/.gstack/ (user metadata, not project
files). The preamble already writes to the same directory successfully.
The exception note gives Claude a reasoning chain: safety argument,
precedent, and consequence of skipping.
* chore: regenerate Codex/agents SKILL.md files with plan-mode exception
* chore: bump version and changelog (v0.9.0.1)
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* feat: community-first telemetry opt-in with anonymous fallback
Default opt-in is now "Help gstack get better!" (community mode with
stable device ID). If declined, offers anonymous mode as a softer
alternative before fully off.
* chore: regenerate SKILL.md files with community-first telemetry prompt
---------
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
feat: opt-in usage telemetry + community intelligence platform (v0.8.6) (#210)
* feat: add gstack-telemetry-log and gstack-analytics scripts
Local telemetry infrastructure for gstack usage tracking.
gstack-telemetry-log appends JSONL events with skill name, duration,
outcome, session ID, and platform info. Supports off/anonymous/community
privacy tiers. gstack-analytics renders a personal usage dashboard
from local data.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* feat: add telemetry preamble injection + opt-in prompt + epilogue
Extends generatePreamble() with telemetry start block (config read,
timer, session ID, .pending marker), opt-in prompt (gated by
.telemetry-prompted), and epilogue instructions for Claude to log
events after skill completion. Adds 5 telemetry tests.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* chore: regenerate all SKILL.md files with telemetry blocks
Automated regeneration from gen-skill-docs.ts changes. All skills
now include telemetry start block, opt-in prompt, and epilogue.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* feat: add Supabase schema, edge functions, and SQL views
Telemetry backend infrastructure: telemetry_events table with RLS
(insert-only), installations table for retention tracking,
update_checks for install pings. Edge functions for update-check
(version + ping), telemetry-ingest (batch insert), and
community-pulse (weekly active count). SQL views for crash
clustering and skill co-occurrence sequences.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* feat: add telemetry-sync, community-dashboard, and integration tests
gstack-telemetry-sync: fire-and-forget JSONL → Supabase sync with
privacy tier field stripping, batch limits, and cursor tracking.
gstack-community-dashboard: CLI tool querying Supabase for skill
popularity, crash clusters, and version distribution.
19 integration tests covering all telemetry scripts.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix: session-specific .pending markers + crash_clusters view fix
Addresses Codex review findings:
- .pending race condition: use .pending-$SESSION_ID instead of
shared .pending file to prevent concurrent session interference
- crash_clusters view: add total_occurrences and anonymous_occurrences
columns since anonymous tier has no installation_id
- Added test: own session pending marker is not finalized
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* feat: dual-attempt update check with Supabase install ping
Fires a parallel background curl to Supabase during the slow-path
version fetch. Logs upgrade_prompted event only on fresh fetches
(not cached replays) to avoid overcounting. GitHub remains the
primary version source — Supabase ping is fire-and-forget.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* feat: integrate telemetry usage stats into /retro output
Retro now reads ~/.gstack/analytics/skill-usage.jsonl and includes
gstack usage metrics (skill run counts, top skills, success rate)
in the weekly retrospective output.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* chore: move 'Skill usage telemetry' to Completed in TODOS.md
Implemented in this branch: local JSONL logging, opt-in prompt,
privacy tiers, Supabase backend, community dashboard, /retro
integration.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* feat: wire Supabase credentials and expose tables via Data API
Add supabase/config.sh with project URL and publishable key (safe to
commit — RLS restricts to INSERT only). Update telemetry-sync,
community-dashboard, and update-check to source the config and
include proper auth headers for the Supabase REST API.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix: add SELECT RLS policies to migration for community dashboard reads
All telemetry data is anonymous (no PII), so public reads via the
publishable key are safe. Needed for the community dashboard to
query skill popularity, crash clusters, and version distribution.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* chore: bump version and changelog (v0.8.6)
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* fix: analytics backward-compatible with old JSONL format
Handle old-format events (no event_type field) alongside new format.
Skip hook_fire events. Fix grep -c whitespace issues and unbound
variable errors.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix: map JSONL field names to Postgres columns in telemetry-sync
Local JSONL uses short names (v, ts, sessions) but the Supabase
table expects full names (schema_version, event_timestamp,
concurrent_sessions). Add sed mapping during field stripping.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix: address Codex adversarial findings — cursor, opt-out, queries
- Sync cursor now advances on HTTP 2xx (not grep for "inserted")
- Update-check respects telemetry opt-out before pinging Supabase
- Dashboard queries use correct view column names (total_occurrences)
- Sync strips old-format "repo" field to prevent privacy leak
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* docs: add Privacy & Telemetry section to README
Transparent disclosure of what telemetry collects, what it never sends,
how to opt out, and a link to the schema so users can verify.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: atomic review log helpers + platform-agnostic templates (v0.8.5) (#209)
* fix: add gstack-review-log and gstack-review-read atomic helpers
Branch names with `/` break review log filepaths when Claude Code runs
multi-line bash blocks as separate shell invocations. These two scripts
encapsulate the full operation in a single command.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix: replace multi-line eval+mkdir+echo blocks with atomic helpers
- Review log writes now use gstack-review-log (single command)
- Review dashboard reads now use gstack-review-read (single command)
- Remaining source+mkdir blocks use && chaining for variable persistence
- Regenerated all SKILL.md files
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix: remove Rails-isms — platform-agnostic templates and checklist
- review/checklist.md: multi-framework examples (Rails/Node/Python/Django)
- plan-ceo-review: framework-agnostic grep + generic error table
- plan-eng-review: "corresponding test" not "JS or Rails test"
- CLAUDE.md: Platform-agnostic design principle + Testing section
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* test: update tests for gstack-review-log/read helpers
- codex review log test: check for gstack-review-log instead of reviews.jsonl
- dashboard resolver tests: check for gstack-review instead of reviews.jsonl
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* chore: bump version and changelog (v0.8.5)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: security hardening + issue triage (v0.8.3) (#205)
* fix: check for bun before running setup (#147)
Users without bun installed got a cryptic "command not found" error.
Now prints a clear message with install instructions.
Closes #147
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix: block SSRF via URL validation in browse commands (#17)
Adds validateNavigationUrl() that blocks non-HTTP(S) schemes (file://,
javascript:, data:) and cloud metadata endpoints (169.254.169.254,
metadata.google.internal). Applied to goto, diff, and newTab commands.
Localhost and private IPs remain allowed for local dev QA.
Closes #17
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix: replace eval $(gstack-slug) with source <(...) (#133)
Eliminates unnecessary use of eval across all skill templates and
generated files. source <(...) has identical behavior without the
shell injection surface. Also hardens gstack-diff-scope usage.
Closes #133
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix: rename /debug to /investigate to avoid Claude Code conflict (#190)
Claude Code has a built-in /debug command that shadows the gstack skill.
Renaming to /investigate which better reflects the systematic root-cause
investigation methodology.
Closes #190
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* test: add unit tests for path validation helpers
validateOutputPath() and validateReadPath() are security-critical
functions with zero test coverage. Adds 14 tests covering safe paths,
traversal attacks, and prefix collision edge cases.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* chore: bump version and changelog (v0.8.3)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* docs: update /debug → /investigate references in docs
CLAUDE.md, README.md, and docs/skills.md still referenced the old
/debug skill name after the rename.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix: harden URL validation against hostname bypasses (Codex P1)
Codex review found that metadata IPs could be reached via hex
(0xA9FEA9FE), decimal (2852039166), octal, trailing dot, and IPv6
bracket forms. Now normalizes hostnames before checking the blocklist
and probes numeric IP representations via URL constructor.
Also moves URL validation before page allocation in newTab() to
prevent zombie tabs on rejection (Codex P3).
5 new test cases for bypass variants.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat: review chaining + commit hash staleness tracking (v0.8.3) (#206)
* feat: review chaining + commit hash staleness tracking
Each plan review skill now suggests the next review via AskUserQuestion:
- CEO review → eng review (required gate) + design review (if UI scope)
- Design review → eng review + CEO review (if product gaps)
- Eng review → design review (if UI changes) + CEO review (soft suggestion)
Reviews now track HEAD commit hash in JSONL entries for deterministic
staleness detection. Dashboard compares stored hash against current HEAD
and reports drift. Respects skip_eng_review config in chaining logic.
Also adds commit tracking to design-review-lite entries.
* chore: regenerate SKILL.md files for review chaining
* chore: bump version and changelog (v0.8.3)
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
feat: /codex skill — multi-AI second opinion + proactive suggestions (#197)
* feat: /codex skill — multi-AI second opinion (review, challenge, consult)
Three modes: code review with pass/fail gate, adversarial challenge mode,
and conversational consult with session continuity. First multi-AI skill
in gstack, wrapping OpenAI's Codex CLI.
* feat: integrate /codex into /review, /ship, /plan-eng-review + dashboard
/review offers Codex second opinion after completing its own review.
/ship offers Codex review as optional gate before pushing.
/plan-eng-review offers Codex plan critique after scope challenge.
Review Readiness Dashboard shows Codex Review as optional row.
* chore: bump version and changelog (v0.8.0)
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* test: codex skill validation (12 stub tests) + E2E eval test
Stub tests (free tier): verify template content — three modes, gate verdict,
session continuity, cost tracking, cross-model comparison, binary discovery,
error handling, mktemp usage, and integrations into /review, /ship, /plan-eng-review.
E2E test (paid tier): runs /codex review on vulnerable fixture repo via
session-runner, verifies output contains findings and GATE verdict.
* fix: codex auth error message — use codex login, not OPENAI_API_KEY
Codex authenticates via ChatGPT OAuth (codex login), not an env var.
* feat: codex uses high reasoning effort by default
gpt-5.2-codex is the only model available with ChatGPT login.
All commands now use model_reasoning_effort="high" for maximum
depth — the whole point is a thorough second opinion.
* feat: crank codex reasoning to xhigh (maximum)
* feat: per-mode reasoning (high for review/consult, xhigh for challenge) + web search
Review and consult use high reasoning — thorough but not slow.
Challenge (adversarial) uses xhigh — maximum depth for breaking code.
All modes enable web_search_cached so Codex can look up docs/APIs.
* refactor: don't hardcode model — use codex default (always latest)
* feat: JSONL output for codex challenge + consult modes
Use --json flag to parse codex's JSONL events, extracting reasoning
traces ([codex thinking]), tool calls ([codex ran]), and token counts.
This gives richer output than the -o flag alone — you can see what
codex thought through before its answer.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* fix: only persist codex-review log when code review actually ran
Don't write a codex-review entry to reviews.jsonl when only the
adversarial challenge (option B) was selected — there's no gate
verdict to record, and a false entry misleads the Review Readiness
Dashboard into thinking a code review happened.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* feat: add codex plan review option to /plan-eng-review
After scope challenge (Step 0), offer to have Codex independently
review the plan with a brutally honest tech reviewer persona.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* test: update e2e test for codex skill
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* fix: codex integration bugs — plan content, review persistence, quoting, stderr
- plan-eng-review: Codex now reads the plan file itself instead of inlining
content as a CLI arg (avoids ARG_MAX for large plans)
- review: add missing echo to persist codex-review results to reviews.jsonl
- codex: consult mode uses $TMPERR (mktemp) instead of hardcoded stderr path
- codex + review: quote $SLUG/$BRANCH_SLUG in review log paths
- codex: scope plan lookup to current project, warn on cross-project fallback
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix: add .context/ to .gitignore to prevent session ID leaks
Codex consult mode stores session IDs in .context/codex-session-id.
Without this ignore rule, session IDs could leak into commits.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* feat: proactive skill suggestions + opt-out + trigger phrase tests
- Preamble reads proactive config via gstack-config
- Root SKILL.md.tmpl has lifecycle map (stage → skill suggestion)
- Users can opt out ("stop suggesting") / opt in ("be proactive again")
- Restored trigger phrase validation tests (16 skills × "Use when" check)
- Added missing "Use when" trigger phrases to /debug and /office-hours
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* chore: update changelog for v0.8.0 — add proactive suggestions note
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
feat: safety hook skills + skill usage telemetry (v0.7.1) (#189)
* feat: add /careful, /freeze, /guard, /unfreeze safety hook skills
Four new on-demand skills using Claude Code's PreToolUse hooks:
- /careful: warns before destructive commands (rm -rf, DROP TABLE, force-push, etc.)
- /freeze: blocks file edits outside a specified directory
- /guard: composes both into one command
- /unfreeze: clears freeze boundary without ending session
Pure bash hook scripts with Python fallback for JSON edge cases.
Safe exceptions for build artifacts (node_modules, dist, .next, etc.).
Hook fire telemetry logs pattern name only (never command content).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* feat: add skill usage telemetry to preamble
TemplateContext system passes skill name through resolver pipeline so
each generated SKILL.md gets its own name baked into the telemetry line.
Appends to ~/.gstack/analytics/skill-usage.jsonl on every invocation.
Covers 14 preamble-using skills + 4 hook skills (inline telemetry).
JSONL format: {"skill":"ship","ts":"...","repo":"my-project"}
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* feat: add analytics CLI for skill usage stats
bun run analytics reads ~/.gstack/analytics/skill-usage.jsonl and shows
top skills, per-repo breakdown, hook fire stats, and daily timeline.
Supports --period 7d/30d/all. Handles missing/empty/malformed data.
22 unit tests cover parsing, filtering, formatting, and edge cases.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* feat: add skills-used-this-week to /retro
Retro Step 2 now reads skill-usage.jsonl and shows which gstack skills
were used during the retro window. Follows the same pattern as the
Greptile signal and Backlog Health metrics — read file, filter by date,
aggregate, present. Skips silently if no analytics data exists.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* test: add hook script and telemetry tests
32 unit tests for check-careful.sh covering all 8 destructive patterns,
safe exceptions, Python fallback, and malformed input handling.
7 unit tests for check-freeze.sh covering boundary enforcement,
trailing slash edge case, and missing state file.
Telemetry tests verify per-skill name correctness in generated output.
Adds careful/freeze/guard/unfreeze/document-release to ALL_SKILLS.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* chore: bump version to 0.6.5 + changelog + mark TODOs shipped
Safety hook skills and skill usage telemetry shipped.
Analytics CLI and /retro integration included.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* feat: /debug auto-freezes edits to the module being debugged
Add PreToolUse hooks (Edit/Write) to debug/SKILL.md.tmpl that reference
the existing freeze/bin/check-freeze.sh. After Phase 1 investigation,
/debug locks edits to the narrowest affected directory.
Graceful degradation: if freeze script is unavailable, scope lock is
skipped. Users can run /unfreeze to remove the restriction.
Deferred 6 enhancements to TODOS.md, gated on telemetry showing the
freeze hook actually fires in real debugging sessions.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat: natural language skill routing + proactive suggestions (v0.7.1) (#195)
* feat: add trigger phrases to /debug and /office-hours
These two skills had zero "Use when asked to..." phrases, making them
completely invisible to natural language. Users saying "debug this" or
"brainstorm an idea" would get no skill invocation.
* feat: add proactive triggers to all workflow skills
Every skill now has "Proactively suggest when..." language so Claude
surfaces skills at natural moments — not just when the user says
specific trigger phrases.
* feat: lifecycle map + proactive preference system
Root gstack description now includes a developer workflow guide mapping
12 stages to skills. Preamble reads proactive preference via gstack-config.
Users can opt out with "stop suggesting things" and re-enable with
"be proactive again" — natural language toggle, no CLI needed.
* test: 11 journey-stage E2E routing tests + trigger phrase validation
Each test simulates a real development stage (ideation, plan review,
debug, QA, ship, retro...) with realistic project context and verifies
the right skill fires from natural language alone. 11/11 pass.
* chore: bump version and changelog (v0.7.1)
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
feat: founder discovery engine + /debug skill — v0.7.0 (#185)
* feat: add escalation protocol to preamble — all skills get DONE/BLOCKED/NEEDS_CONTEXT
Every skill now reports completion status (DONE, DONE_WITH_CONCERNS, BLOCKED,
NEEDS_CONTEXT) and has escalation rules: 3 failed attempts → STOP, security
uncertainty → STOP, scope exceeds verification → STOP.
"It is always OK to stop and say 'this is too hard for me.'"
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* feat: add verification gate to /ship (Step 6.5) — no push without fresh evidence
Before pushing, re-verify tests if code changed during review fixes.
Rationalization prevention: "Should work now" → RUN IT.
"I'm confident" → Confidence is not evidence.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* feat: add scope drift detection + verification of claims to /review
Step 1.5: Before reviewing code quality, check if the diff matches stated
intent. Flags scope creep and missing requirements (INFORMATIONAL).
Step 5 addition: Every review claim must cite evidence — "this pattern is
safe" needs a line reference, "tests cover this" needs a test name.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* feat: mandatory implementation alternatives + design doc lookup in /plan-ceo-review
Step 0C-bis: Every plan must consider 2-3 approaches (minimal viable vs ideal
architecture) before mode selection. RECOMMENDATION required.
Pre-Review System Audit now checks ~/.gstack/projects/ for /brainstorm design
docs (branch-filtered with fallback).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* feat: design doc lookup in /plan-eng-review + fix branch name sanitization
Step 0 now checks ~/.gstack/projects/ for /brainstorm design docs
(branch-filtered with fallback, reads Supersedes: for revision context).
Fix: branch names with '/' (e.g. garrytan/better-process) now get
sanitized via tr '/' '-' in test plan artifact filenames.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* feat: new /brainstorm and /debug skills
/brainstorm: Socratic design exploration before planning. Context gathering,
clarifying questions (smart-skip), related design discovery (keyword grep),
premise challenge, forced alternatives, design doc artifact with lineage
tracking (Supersedes: field). Writes to ~/.gstack/projects/$SLUG/.
/debug: Systematic root-cause debugging. Iron Law: no fixes without root
cause investigation. Pattern analysis, hypothesis testing with 3-strike
escalation, structured DEBUG REPORT output.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* test: structural tests for new skills + escalation protocol assertions
Add brainstorm + debug to skillsWithUpdateCheck and skillsWithPreamble arrays.
Add structural tests: brainstorm (Phase 1-6, Design Doc, Supersedes, Smart-skip),
debug (Iron Law, Root Cause, Pattern Analysis, Hypothesis, DEBUG REPORT, 3-strike).
Add escalation protocol tests (DONE_WITH_CONCERNS, BLOCKED, NEEDS_CONTEXT) for
all preamble skills.
Also: 2 new TODOs (design docs → Supabase sync, /plan-design-review skill),
update CLAUDE.md project structure with new skill directories.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* chore: bump version and changelog (v0.6.0)
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* chore: rename /brainstorm → /office-hours across references
Update CHANGELOG, CLAUDE.md, TODOS, design-consultation, plan-ceo-review,
and gen-skill-docs to reference the new office-hours skill name.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* feat: YC Office Hours — dual-mode product diagnostic + builder brainstorm
Rewrite /office-hours with two modes:
Startup mode: six forcing questions (Demand Reality, Status Quo, Desperate
Specificity, Narrowest Wedge, Observation & Surprise, Future-Fit) that push
founders toward radical honesty about demand, users, and product decisions.
Includes smart routing by product stage, intrapreneurship adaptation, and
YC apply CTA for strong-signal founders.
Builder mode: generative brainstorming for side projects, hackathons,
learning, and open source. Enthusiastic collaborator tone, design thinking
questions, no business interrogation.
Mode is determined by an explicit question in Phase 1 — no guessing.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* test: add 14 assertions for YC Office Hours content coverage
Validates dual-mode structure (Startup/Builder), all six forcing questions,
builder brainstorming content, intrapreneurship adaptation, YC apply CTA,
and operating principles for both modes. 192 tests total, all passing.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* docs: update project documentation for v0.6.1
- README.md: added /office-hours and /debug to skills table, updated
skill count from 13 to 15, added both to install instructions
- docs/skills.md: added /office-hours and /debug deep dive sections
- CLAUDE.md: updated office-hours description to reflect dual-mode
- CONTRIBUTING.md: updated skill count from 13 to 15
- CHANGELOG.md: added YC Office Hours and /debug entries to 0.6.0
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* feat: founder discovery engine in /office-hours (v0.7.0)
Turn /office-hours into a YC founder discovery engine. Every session now
ends with three beats: signal reflection (specific callbacks to what the
user said), "One more thing." transition, and a personal plea from Garry
Tan with three tiers based on founder signal strength. Top tier uses
AskUserQuestion to ask directly and opens ycombinator.com/apply?ref=gstack.
Adds Phase 4.5 (Founder Signal Synthesis), "What I noticed about how you
think" section to both design doc templates, anti-slop GOOD/BAD examples,
and emotional targets per tier.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* test: add validation assertions for founder discovery engine
8 new assertions covering: YC apply CTA with ref=gstack tracking,
"What I noticed" design doc section, golden age framing, Garry Tan
personal plea, founder signal synthesis phase, three-tier decision
rubric, anti-slop GOOD/BAD examples, "One more thing" transition beat.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* docs: update project documentation for v0.7.0
VERSION: 0.6.4.1 → 0.7.0
CHANGELOG: new entry — Office Hours Gets Personal
README: updated /office-hours and /plan-design-review descriptions
docs/skills.md: updated /office-hours table + deep dive section
TODOS.md: added /yc-prep skill TODO (P2)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix: remove duplicate Install section, fix stale skills lists, deduplicate CHANGELOG entries
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat: add trigger phrases to skill descriptions for better model matching (v0.6.4.1) (#169)
* feat: add trigger phrases to skill descriptions for better model matching
Anthropic's skill best practices: "the description field is not a summary —
it's when to trigger." Add explicit "Use when asked to..." phrases to 12 skill
descriptions so Claude's auto-discovery works with natural language requests
like "deploy this" or "check my diff", not just explicit /slash-commands.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* docs: add on-demand hooks and telemetry to TODOS.md
Captures two ideas from Anthropic's skill best practices post:
- /careful, /freeze, /guard on-demand hook skills (P3)
- Skill usage telemetry via preamble JSONL append (P3)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* chore: bump version and changelog (v0.6.4.1)
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* docs: exclude internal details from CHANGELOG style guide
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat: interactive /plan-design-review + CEO invokes designer + 100% coverage (v0.6.4) (#149)
* refactor: rename qa-design-review → design-review
The "qa-" prefix was confusing — this is the live-site design audit with
fix loop, not a QA-only report. Rename directory and update all references
across docs, tests, scripts, and skill templates.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* feat: interactive /plan-design-review + CEO invokes designer
Rewrite /plan-design-review from report-only grading to an interactive
plan-fixer that rates each design dimension 0-10, explains what a 10
looks like, and edits the plan to get there. Parallel structure with
/plan-ceo-review and /plan-eng-review — one issue = one AskUserQuestion.
CEO review now detects UI scope and invokes the designer perspective
when the plan has frontend/UX work, so you get design review
automatically when it matters.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* test: validation + touchfile entries for 100% coverage
Add design-consultation to command/snapshot flag validation. Add 4
skills to contributor mode validation (plan-design-review,
design-review, design-consultation, document-release). Add 2 templates
to hardcoded branch check. Register touchfile entries for 10 new
LLM-judge tests and 1 new E2E test.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* test: LLM-judge for 10 skills + gstack-upgrade E2E
Add LLM-judge quality evals for all uncovered skills using a DRY
runWorkflowJudge helper with section marker guards. Add real E2E
test for gstack-upgrade using mock git remote (replaces test.todo).
Add plan-edit assertion to plan-design-review E2E.
14/15 skills now at full coverage. setup-browser-cookies remains
deferred (needs real browser).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* docs: add bisect commit style to CLAUDE.md
All commits should be single logical changes, split before pushing.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* chore: bump version and changelog (v0.6.4.0)
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat: design review lite in /review and /ship + gstack-diff-scope (v0.6.3) (#142)
* feat: gstack-diff-scope helper + design review checklist
bin/gstack-diff-scope categorizes branch changes into SCOPE_FRONTEND,
SCOPE_BACKEND, SCOPE_PROMPTS, SCOPE_TESTS, SCOPE_DOCS, SCOPE_CONFIG.
review/design-checklist.md is a 20-item code-level checklist with
HIGH/MEDIUM/LOW confidence tags for detecting design anti-patterns
from source code.
* feat: integrate design review lite into /review and /ship
Add generateDesignReviewLite() resolver, insert {{DESIGN_REVIEW_LITE}}
partial in review Step 4.5 and ship Step 3.5. Update dashboard to
recognize design-review-lite entries. Ship pre-flight uses
gstack-diff-scope for smarter design review recommendations.
* test: E2E eval for design review lite detection
Planted CSS/HTML fixtures with 7 design anti-patterns. E2E test
verifies /review catches >= 4 of 7 (Papyrus font, 14px body text,
outline:none, !important, purple gradient, generic hero copy,
3-column feature grid).
* chore: bump version and changelog (v0.6.3.0)
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
feat: Completeness Principle — Boil the Lake (v0.6.1) (#140)
* feat: Completeness Principle — Boil the Lake (WIP, pre-merge)
Add Completeness Principle to all skill preambles, dual-time estimates,
compression table, anti-pattern gallery, Lake Score, and completeness
gaps review category. VERSION/CHANGELOG will be rebased after merge.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix: update stale version reference in TODOS.md (v0.5.3 → v0.6.1)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* docs: update CHANGELOG date + README for v0.6.1 features
- Add date to CHANGELOG 0.6.1 entry
- Add Completeness Principle to README intro
- Add SELECTIVE EXPANSION mode to CEO review section
- Add test bootstrap mention to /ship section
- Fix uninstall command missing design-consultation in project uninstall
- Add "recommends shortcuts" and "no tests" to Without gstack list
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* docs: split README into lean intro + docs/ directory (gh CLI pattern)
README: 875 → 243 lines. Keeps intro, skill table, demo, install, and
troubleshooting. All per-skill deep dives, Greptile integration guide,
and contributor mode docs moved to docs/ directory.
- docs/skills.md — full philosophy and examples for all 13 skills
- docs/greptile.md — Greptile setup and triage workflow
- docs/contributor-mode.md — how to enable and use contributor mode
- README now links to docs/ via Documentation table
- Updated skill table entries with latest features (fix-first, regression
tests, test health, completeness gaps)
- Updated demo transcript with AUTO-FIXED, coverage audit, regression test
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* docs: remove "competitor" language, rewrite README in Garry's voice
Replace "browses competitors" with "knows the landscape" / "what's out
there" throughout all user-facing copy. Trim README from 243 to 167
lines — tighter, more opinionated, less listicle energy. Remove
Completeness Principle from README top (it lives in CLAUDE.md and the
skill preambles where Claude actually reads it).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* docs: rewrite README in Garry's raw voice — AGI era, L8 factory, real stories
The README now sounds like Garry, not a product page. Leads with the
live experiment, the 16k LOC/day reality, the real-life coding stories
(Austin, hospital bedside). Highlights the newest unlocks (design at
the heart, /qa parallelism, smart review routing, test bootstrap).
Closes with an open invitation — free MIT, fork it, let's all ride
the wave together.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* docs: add Garry's bonafides to README intro — Palantir, Posterous, YC, 600k LOC
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* docs: add real /retro numbers — 140k lines, 362 commits across 3 projects
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* docs: add "in the last 60 days" timeframe to 600k LOC claim
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* docs: add GitHub contribution graphs — 2026 vs 2013 side by side
Same person, different era. 2013: 772 contributions building Bookface.
2026: 1,237 contributions and accelerating. The difference is the tooling.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* docs: clarify /retro stats are from last 7 days
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* docs: add designer/PM/eng manager roles to intro
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* docs: remove Josh/L8 reference from README
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* docs: move demo up, make it dramatically more impressive
Show the actual architecture diagram, auto-fixed issues, 100% coverage,
regression test generation. Punch line: "That is not a copilot. That is
a team."
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* docs: remove "My journey" section — intro already covers it
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* docs: prefix all skill commands with You: in demo transcript
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* docs: collapse You/Claude lines in demo — no gap between command and response
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* docs: clarify plan mode flow in demo — approve, exit, Claude implements
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* docs: move /ship to end of demo — review → QA → ship is the real flow
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* docs: add /plan-design-review to demo, tighten CEO response
Shorter CEO reply, compressed eng diagram, added design audit with
AI Slop score. Seven commands now: plan → eng → build → design →
review → QA → ship.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* docs: move design review before implementation — it's part of planning
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* docs: reorder demo — design before eng, after CEO
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* docs: remove URL from /plan-design-review in demo
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* docs: add [...] annotations showing what actually happens at each step
Each step now shows what the agent does under the hood: 8 expansion
proposals cherry-picked, 80-item design audit, ASCII diagrams for
every flow, 2400 lines written in 8 minutes, real browser QA, bug
found and fixed. Makes the demo feel real, not abstract.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* docs: rename Contributor Mode to How to Contribute in docs table
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* docs: add Coinbase, Instacart, Rippling to YC bonafides
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* docs: add "one or two people in a garage" to founder story
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* docs: add skill table to top of skills.md with anchor links
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* docs: consolidate — roll contributor-mode into CONTRIBUTING, greptile into skills
- docs/contributor-mode.md → merged into CONTRIBUTING.md (session awareness section)
- docs/greptile.md → merged into docs/skills.md (Greptile integration section)
- Reordered docs table: Skills > Architecture > Browser > Contributing > Changelog
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>