From a0328be04c37ce4a52f6fb169b355400ecbb0ef2 Mon Sep 17 00:00:00 2001 From: Garry Tan Date: Mon, 30 Mar 2026 21:45:28 -0600 Subject: [PATCH] feat: always-on adversarial review + scope drift + plan mode design tools (v0.14.3.0) (#694) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit * feat: always-on adversarial review + scope drift resolver + cross-model tension format Rewrite generateAdversarialStep() to remove LOC-based tier skipping. Every review now runs both Claude adversarial subagent and Codex adversarial challenge. OLD_CFG only gates Codex passes, not Claude. Add generateScopeDrift() shared resolver. Fix cross-model tension AskUserQuestion to include RECOMMENDATION + Completeness. * feat: add scope drift to /ship, extract from /review template /ship gets {{SCOPE_DRIFT}} at Step 3.48 + PR body slot. /review replaces hardcoded scope drift with {{SCOPE_DRIFT}} + {{PLAN_COMPLETION_AUDIT_REVIEW}}. * feat: plan mode safe operations — browse, design, codex allowed in plan mode Add preamble section declaring $B, $D, codex, and ~/.gstack/ writes as plan-mode-safe. Unblocks design skills during planning. * test: update adversarial + add scope drift assertions Rename adversarial tests to reflect always-on behavior. Remove tier threshold assertions. Add scope drift content assertions for both /review and /ship generated SKILL.md files. * chore: bump version and changelog (v0.14.3.0) Co-Authored-By: Claude Opus 4.6 --------- Co-authored-by: Claude Opus 4.6 --- CHANGELOG.md | 16 ++++ SKILL.md | 15 ++++ VERSION | 2 +- autoplan/SKILL.md | 15 ++++ benchmark/SKILL.md | 15 ++++ browse/SKILL.md | 15 ++++ canary/SKILL.md | 15 ++++ codex/SKILL.md | 15 ++++ connect-chrome/SKILL.md | 15 ++++ cso/SKILL.md | 15 ++++ design-consultation/SKILL.md | 15 ++++ design-html/SKILL.md | 15 ++++ design-review/SKILL.md | 15 ++++ design-shotgun/SKILL.md | 15 ++++ document-release/SKILL.md | 15 ++++ investigate/SKILL.md | 15 ++++ land-and-deploy/SKILL.md | 15 ++++ learn/SKILL.md | 15 ++++ office-hours/SKILL.md | 15 ++++ plan-ceo-review/SKILL.md | 20 ++++- plan-design-review/SKILL.md | 17 +++- plan-eng-review/SKILL.md | 20 ++++- qa-only/SKILL.md | 15 ++++ qa/SKILL.md | 15 ++++ retro/SKILL.md | 15 ++++ review/SKILL.md | 143 +++++++++++++++++---------------- review/SKILL.md.tmpl | 35 +------- scripts/resolvers/index.ts | 3 +- scripts/resolvers/preamble.ts | 15 ++++ scripts/resolvers/review.ts | 127 ++++++++++++++++++----------- setup-browser-cookies/SKILL.md | 15 ++++ setup-deploy/SKILL.md | 15 ++++ ship/SKILL.md | 134 +++++++++++++++++++----------- ship/SKILL.md.tmpl | 6 ++ test/skill-validation.test.ts | 45 +++++++---- 35 files changed, 692 insertions(+), 221 deletions(-) diff --git a/CHANGELOG.md b/CHANGELOG.md index cfa6a53d4efdfb9230b8f604af34998cf3023da5..356ae97edb231a0babe6f041cc9eb6e8b16a65b1 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -1,5 +1,21 @@ # Changelog +## [0.14.3.0] - 2026-03-31 — Always-On Adversarial Review + Scope Drift + Plan Mode Design Tools + +Every code review now runs adversarial analysis from both Claude and Codex, regardless of diff size. A 5-line auth change gets the same cross-model scrutiny as a 500-line feature. The old "skip adversarial for small diffs" heuristic is gone... diff size was never a good proxy for risk. + +### Added + +- **Always-on adversarial review.** Every `/review` and `/ship` run now dispatches both a Claude adversarial subagent and a Codex adversarial challenge. No more tier-based skipping. The Codex structured review (formal P1 pass/fail gate) still runs on large diffs (200+ lines) where the formal gate adds value. +- **Scope drift detection in `/ship`.** Before shipping, `/ship` now checks whether you built what you said you'd build, nothing more, nothing less. Catches scope creep ("while I was in there..." changes) and missing requirements. Results appear in the PR body. +- **Plan Mode Safe Operations.** Browse screenshots, design mockups, Codex outside voices, and writing to `~/.gstack/` are now explicitly allowed in plan mode. Design-related skills (`/design-consultation`, `/design-shotgun`, `/design-html`, `/plan-design-review`) can generate visual artifacts during planning without fighting plan mode restrictions. + +### Changed + +- **Adversarial opt-out split.** The legacy `codex_reviews=disabled` config now only gates Codex passes. Claude adversarial subagent always runs since it's free and fast. Previously the kill switch disabled everything. +- **Cross-model tension format.** Outside voice disagreements now include `RECOMMENDATION` and `Completeness` scores, matching the standard AskUserQuestion format used everywhere else in gstack. +- **Scope drift is now a shared resolver.** Extracted from `/review` into `generateScopeDrift()` so both `/review` and `/ship` use the same logic. DRY. + ## [0.14.2.0] - 2026-03-30 — Sidebar CSS Inspector + Per-Tab Agents The sidebar is now a visual design tool. Pick any element on the page and see the full CSS rule cascade, box model, and computed styles right in the Side Panel. Edit styles live and see changes instantly. Each browser tab gets its own independent agent, so you can work on multiple pages simultaneously without cross-talk. Cleanup is LLM-powered... the agent snapshots the page, understands it semantically, and removes the junk while keeping the site's identity. diff --git a/SKILL.md b/SKILL.md index 9e6377f685b2b9be6c7de8b1afd67b1be44e595c..29b3110c16dfe71c518b827a3a10e00ecec84626 100644 --- a/SKILL.md +++ b/SKILL.md @@ -286,6 +286,21 @@ If you cannot determine the outcome, use "unknown". Both local JSONL and remote telemetry only run if telemetry is not off. The remote binary additionally requires the binary to exist. +## Plan Mode Safe Operations + +When in plan mode, these operations are always allowed because they produce +artifacts that inform the plan, not code changes: + +- `$B` commands (browse: screenshots, page inspection, navigation, snapshots) +- `$D` commands (design: generate mockups, variants, comparison boards, iterate) +- `codex exec` / `codex review` (outside voice, plan review, adversarial challenge) +- Writing to `~/.gstack/` (config, analytics, review logs, design artifacts, learnings) +- Writing to the plan file (already allowed by plan mode) +- `open` commands for viewing generated artifacts (comparison boards, HTML previews) + +These are read-only in spirit — they inspect the live site, generate visual artifacts, +or get independent opinions. They do NOT modify project source files. + ## Plan Status Footer When you are in plan mode and about to call ExitPlanMode: diff --git a/VERSION b/VERSION index 87df6b0551c302ff7c9729960d6404dafcb89dd6..574c7c4a637364d8de234b04f893841fcfbe8afe 100644 --- a/VERSION +++ b/VERSION @@ -1 +1 @@ -0.14.2.0 +0.14.3.0 diff --git a/autoplan/SKILL.md b/autoplan/SKILL.md index 2754cef0759efc54851ce18435ad331ca2d8ab2f..31ae9ab2184b0c9c8493506c6768b06e8817e4db 100644 --- a/autoplan/SKILL.md +++ b/autoplan/SKILL.md @@ -378,6 +378,21 @@ If you cannot determine the outcome, use "unknown". Both local JSONL and remote telemetry only run if telemetry is not off. The remote binary additionally requires the binary to exist. +## Plan Mode Safe Operations + +When in plan mode, these operations are always allowed because they produce +artifacts that inform the plan, not code changes: + +- `$B` commands (browse: screenshots, page inspection, navigation, snapshots) +- `$D` commands (design: generate mockups, variants, comparison boards, iterate) +- `codex exec` / `codex review` (outside voice, plan review, adversarial challenge) +- Writing to `~/.gstack/` (config, analytics, review logs, design artifacts, learnings) +- Writing to the plan file (already allowed by plan mode) +- `open` commands for viewing generated artifacts (comparison boards, HTML previews) + +These are read-only in spirit — they inspect the live site, generate visual artifacts, +or get independent opinions. They do NOT modify project source files. + ## Plan Status Footer When you are in plan mode and about to call ExitPlanMode: diff --git a/benchmark/SKILL.md b/benchmark/SKILL.md index 8fe4bc94cdb5af1a8a52474bea7376b0f3aa8976..aa6567df30bc446f2829585dca879c7dd044ea2c 100644 --- a/benchmark/SKILL.md +++ b/benchmark/SKILL.md @@ -288,6 +288,21 @@ If you cannot determine the outcome, use "unknown". Both local JSONL and remote telemetry only run if telemetry is not off. The remote binary additionally requires the binary to exist. +## Plan Mode Safe Operations + +When in plan mode, these operations are always allowed because they produce +artifacts that inform the plan, not code changes: + +- `$B` commands (browse: screenshots, page inspection, navigation, snapshots) +- `$D` commands (design: generate mockups, variants, comparison boards, iterate) +- `codex exec` / `codex review` (outside voice, plan review, adversarial challenge) +- Writing to `~/.gstack/` (config, analytics, review logs, design artifacts, learnings) +- Writing to the plan file (already allowed by plan mode) +- `open` commands for viewing generated artifacts (comparison boards, HTML previews) + +These are read-only in spirit — they inspect the live site, generate visual artifacts, +or get independent opinions. They do NOT modify project source files. + ## Plan Status Footer When you are in plan mode and about to call ExitPlanMode: diff --git a/browse/SKILL.md b/browse/SKILL.md index 9ad795bcd9faf9009e625c24bc24250f63b92995..464c03ae1ae382abff8b2a5503914862e19c414a 100644 --- a/browse/SKILL.md +++ b/browse/SKILL.md @@ -288,6 +288,21 @@ If you cannot determine the outcome, use "unknown". Both local JSONL and remote telemetry only run if telemetry is not off. The remote binary additionally requires the binary to exist. +## Plan Mode Safe Operations + +When in plan mode, these operations are always allowed because they produce +artifacts that inform the plan, not code changes: + +- `$B` commands (browse: screenshots, page inspection, navigation, snapshots) +- `$D` commands (design: generate mockups, variants, comparison boards, iterate) +- `codex exec` / `codex review` (outside voice, plan review, adversarial challenge) +- Writing to `~/.gstack/` (config, analytics, review logs, design artifacts, learnings) +- Writing to the plan file (already allowed by plan mode) +- `open` commands for viewing generated artifacts (comparison boards, HTML previews) + +These are read-only in spirit — they inspect the live site, generate visual artifacts, +or get independent opinions. They do NOT modify project source files. + ## Plan Status Footer When you are in plan mode and about to call ExitPlanMode: diff --git a/canary/SKILL.md b/canary/SKILL.md index 6197b3c42a9ab2ed9464a9510ce80d555c867cda..48c6e0b904f7710cc641caff0609c38ab5a32a29 100644 --- a/canary/SKILL.md +++ b/canary/SKILL.md @@ -353,6 +353,21 @@ If you cannot determine the outcome, use "unknown". Both local JSONL and remote telemetry only run if telemetry is not off. The remote binary additionally requires the binary to exist. +## Plan Mode Safe Operations + +When in plan mode, these operations are always allowed because they produce +artifacts that inform the plan, not code changes: + +- `$B` commands (browse: screenshots, page inspection, navigation, snapshots) +- `$D` commands (design: generate mockups, variants, comparison boards, iterate) +- `codex exec` / `codex review` (outside voice, plan review, adversarial challenge) +- Writing to `~/.gstack/` (config, analytics, review logs, design artifacts, learnings) +- Writing to the plan file (already allowed by plan mode) +- `open` commands for viewing generated artifacts (comparison boards, HTML previews) + +These are read-only in spirit — they inspect the live site, generate visual artifacts, +or get independent opinions. They do NOT modify project source files. + ## Plan Status Footer When you are in plan mode and about to call ExitPlanMode: diff --git a/codex/SKILL.md b/codex/SKILL.md index a3eb12cf3aaacba13e8a53fcf74c3b1290a5d1d4..34c1b121af760f27529b2ce8e2a9799131ecb4e1 100644 --- a/codex/SKILL.md +++ b/codex/SKILL.md @@ -372,6 +372,21 @@ If you cannot determine the outcome, use "unknown". Both local JSONL and remote telemetry only run if telemetry is not off. The remote binary additionally requires the binary to exist. +## Plan Mode Safe Operations + +When in plan mode, these operations are always allowed because they produce +artifacts that inform the plan, not code changes: + +- `$B` commands (browse: screenshots, page inspection, navigation, snapshots) +- `$D` commands (design: generate mockups, variants, comparison boards, iterate) +- `codex exec` / `codex review` (outside voice, plan review, adversarial challenge) +- Writing to `~/.gstack/` (config, analytics, review logs, design artifacts, learnings) +- Writing to the plan file (already allowed by plan mode) +- `open` commands for viewing generated artifacts (comparison boards, HTML previews) + +These are read-only in spirit — they inspect the live site, generate visual artifacts, +or get independent opinions. They do NOT modify project source files. + ## Plan Status Footer When you are in plan mode and about to call ExitPlanMode: diff --git a/connect-chrome/SKILL.md b/connect-chrome/SKILL.md index c863b171fd96f962668c1bb1369638a355d4eabd..d1736f29f6f9b17a75ed5e47d10eacfeb77bfb9a 100644 --- a/connect-chrome/SKILL.md +++ b/connect-chrome/SKILL.md @@ -369,6 +369,21 @@ If you cannot determine the outcome, use "unknown". Both local JSONL and remote telemetry only run if telemetry is not off. The remote binary additionally requires the binary to exist. +## Plan Mode Safe Operations + +When in plan mode, these operations are always allowed because they produce +artifacts that inform the plan, not code changes: + +- `$B` commands (browse: screenshots, page inspection, navigation, snapshots) +- `$D` commands (design: generate mockups, variants, comparison boards, iterate) +- `codex exec` / `codex review` (outside voice, plan review, adversarial challenge) +- Writing to `~/.gstack/` (config, analytics, review logs, design artifacts, learnings) +- Writing to the plan file (already allowed by plan mode) +- `open` commands for viewing generated artifacts (comparison boards, HTML previews) + +These are read-only in spirit — they inspect the live site, generate visual artifacts, +or get independent opinions. They do NOT modify project source files. + ## Plan Status Footer When you are in plan mode and about to call ExitPlanMode: diff --git a/cso/SKILL.md b/cso/SKILL.md index 3945884cffd73c437f968d8974de43fa0196e02f..ca79f2235f12199e8f913056d64ae52298757b0a 100644 --- a/cso/SKILL.md +++ b/cso/SKILL.md @@ -357,6 +357,21 @@ If you cannot determine the outcome, use "unknown". Both local JSONL and remote telemetry only run if telemetry is not off. The remote binary additionally requires the binary to exist. +## Plan Mode Safe Operations + +When in plan mode, these operations are always allowed because they produce +artifacts that inform the plan, not code changes: + +- `$B` commands (browse: screenshots, page inspection, navigation, snapshots) +- `$D` commands (design: generate mockups, variants, comparison boards, iterate) +- `codex exec` / `codex review` (outside voice, plan review, adversarial challenge) +- Writing to `~/.gstack/` (config, analytics, review logs, design artifacts, learnings) +- Writing to the plan file (already allowed by plan mode) +- `open` commands for viewing generated artifacts (comparison boards, HTML previews) + +These are read-only in spirit — they inspect the live site, generate visual artifacts, +or get independent opinions. They do NOT modify project source files. + ## Plan Status Footer When you are in plan mode and about to call ExitPlanMode: diff --git a/design-consultation/SKILL.md b/design-consultation/SKILL.md index d1b38a111633c4057db109627163e38d6da09b40..ff6b030ca11f8020280c372e2dd5b8f44eb8e238 100644 --- a/design-consultation/SKILL.md +++ b/design-consultation/SKILL.md @@ -376,6 +376,21 @@ If you cannot determine the outcome, use "unknown". Both local JSONL and remote telemetry only run if telemetry is not off. The remote binary additionally requires the binary to exist. +## Plan Mode Safe Operations + +When in plan mode, these operations are always allowed because they produce +artifacts that inform the plan, not code changes: + +- `$B` commands (browse: screenshots, page inspection, navigation, snapshots) +- `$D` commands (design: generate mockups, variants, comparison boards, iterate) +- `codex exec` / `codex review` (outside voice, plan review, adversarial challenge) +- Writing to `~/.gstack/` (config, analytics, review logs, design artifacts, learnings) +- Writing to the plan file (already allowed by plan mode) +- `open` commands for viewing generated artifacts (comparison boards, HTML previews) + +These are read-only in spirit — they inspect the live site, generate visual artifacts, +or get independent opinions. They do NOT modify project source files. + ## Plan Status Footer When you are in plan mode and about to call ExitPlanMode: diff --git a/design-html/SKILL.md b/design-html/SKILL.md index d21750a54483e9f210f112905f922c441c895211..24183b90591a6373b71fd1d02147c5c2ea81a36d 100644 --- a/design-html/SKILL.md +++ b/design-html/SKILL.md @@ -358,6 +358,21 @@ If you cannot determine the outcome, use "unknown". Both local JSONL and remote telemetry only run if telemetry is not off. The remote binary additionally requires the binary to exist. +## Plan Mode Safe Operations + +When in plan mode, these operations are always allowed because they produce +artifacts that inform the plan, not code changes: + +- `$B` commands (browse: screenshots, page inspection, navigation, snapshots) +- `$D` commands (design: generate mockups, variants, comparison boards, iterate) +- `codex exec` / `codex review` (outside voice, plan review, adversarial challenge) +- Writing to `~/.gstack/` (config, analytics, review logs, design artifacts, learnings) +- Writing to the plan file (already allowed by plan mode) +- `open` commands for viewing generated artifacts (comparison boards, HTML previews) + +These are read-only in spirit — they inspect the live site, generate visual artifacts, +or get independent opinions. They do NOT modify project source files. + ## Plan Status Footer When you are in plan mode and about to call ExitPlanMode: diff --git a/design-review/SKILL.md b/design-review/SKILL.md index c152835c7ec9e4daac2a1adc45b1290ab02bdfae..3be5a7c417b4491ebc2c55f614a10de84c6fb91b 100644 --- a/design-review/SKILL.md +++ b/design-review/SKILL.md @@ -376,6 +376,21 @@ If you cannot determine the outcome, use "unknown". Both local JSONL and remote telemetry only run if telemetry is not off. The remote binary additionally requires the binary to exist. +## Plan Mode Safe Operations + +When in plan mode, these operations are always allowed because they produce +artifacts that inform the plan, not code changes: + +- `$B` commands (browse: screenshots, page inspection, navigation, snapshots) +- `$D` commands (design: generate mockups, variants, comparison boards, iterate) +- `codex exec` / `codex review` (outside voice, plan review, adversarial challenge) +- Writing to `~/.gstack/` (config, analytics, review logs, design artifacts, learnings) +- Writing to the plan file (already allowed by plan mode) +- `open` commands for viewing generated artifacts (comparison boards, HTML previews) + +These are read-only in spirit — they inspect the live site, generate visual artifacts, +or get independent opinions. They do NOT modify project source files. + ## Plan Status Footer When you are in plan mode and about to call ExitPlanMode: diff --git a/design-shotgun/SKILL.md b/design-shotgun/SKILL.md index 3d3421f9a0181bc76841df02cf0183381f33d457..5e29bfccccbef612dd44545fce3300a96056dc21 100644 --- a/design-shotgun/SKILL.md +++ b/design-shotgun/SKILL.md @@ -355,6 +355,21 @@ If you cannot determine the outcome, use "unknown". Both local JSONL and remote telemetry only run if telemetry is not off. The remote binary additionally requires the binary to exist. +## Plan Mode Safe Operations + +When in plan mode, these operations are always allowed because they produce +artifacts that inform the plan, not code changes: + +- `$B` commands (browse: screenshots, page inspection, navigation, snapshots) +- `$D` commands (design: generate mockups, variants, comparison boards, iterate) +- `codex exec` / `codex review` (outside voice, plan review, adversarial challenge) +- Writing to `~/.gstack/` (config, analytics, review logs, design artifacts, learnings) +- Writing to the plan file (already allowed by plan mode) +- `open` commands for viewing generated artifacts (comparison boards, HTML previews) + +These are read-only in spirit — they inspect the live site, generate visual artifacts, +or get independent opinions. They do NOT modify project source files. + ## Plan Status Footer When you are in plan mode and about to call ExitPlanMode: diff --git a/document-release/SKILL.md b/document-release/SKILL.md index a1c6ede925043442b7b79fb355505513e596a72d..7001fd6cacc505a81cb28c537970e2a373893fec 100644 --- a/document-release/SKILL.md +++ b/document-release/SKILL.md @@ -355,6 +355,21 @@ If you cannot determine the outcome, use "unknown". Both local JSONL and remote telemetry only run if telemetry is not off. The remote binary additionally requires the binary to exist. +## Plan Mode Safe Operations + +When in plan mode, these operations are always allowed because they produce +artifacts that inform the plan, not code changes: + +- `$B` commands (browse: screenshots, page inspection, navigation, snapshots) +- `$D` commands (design: generate mockups, variants, comparison boards, iterate) +- `codex exec` / `codex review` (outside voice, plan review, adversarial challenge) +- Writing to `~/.gstack/` (config, analytics, review logs, design artifacts, learnings) +- Writing to the plan file (already allowed by plan mode) +- `open` commands for viewing generated artifacts (comparison boards, HTML previews) + +These are read-only in spirit — they inspect the live site, generate visual artifacts, +or get independent opinions. They do NOT modify project source files. + ## Plan Status Footer When you are in plan mode and about to call ExitPlanMode: diff --git a/investigate/SKILL.md b/investigate/SKILL.md index ab940d17535639b1f901e0fc2357fd1041b933fc..a65025ec8d7cf6d1732a4e70147050e624fc7a6c 100644 --- a/investigate/SKILL.md +++ b/investigate/SKILL.md @@ -370,6 +370,21 @@ If you cannot determine the outcome, use "unknown". Both local JSONL and remote telemetry only run if telemetry is not off. The remote binary additionally requires the binary to exist. +## Plan Mode Safe Operations + +When in plan mode, these operations are always allowed because they produce +artifacts that inform the plan, not code changes: + +- `$B` commands (browse: screenshots, page inspection, navigation, snapshots) +- `$D` commands (design: generate mockups, variants, comparison boards, iterate) +- `codex exec` / `codex review` (outside voice, plan review, adversarial challenge) +- Writing to `~/.gstack/` (config, analytics, review logs, design artifacts, learnings) +- Writing to the plan file (already allowed by plan mode) +- `open` commands for viewing generated artifacts (comparison boards, HTML previews) + +These are read-only in spirit — they inspect the live site, generate visual artifacts, +or get independent opinions. They do NOT modify project source files. + ## Plan Status Footer When you are in plan mode and about to call ExitPlanMode: diff --git a/land-and-deploy/SKILL.md b/land-and-deploy/SKILL.md index 6311ecfc7a4f795376334da1facc4489bc24b123..2cca312ebcc24b8709067c17e03bbbcb564c3962 100644 --- a/land-and-deploy/SKILL.md +++ b/land-and-deploy/SKILL.md @@ -370,6 +370,21 @@ If you cannot determine the outcome, use "unknown". Both local JSONL and remote telemetry only run if telemetry is not off. The remote binary additionally requires the binary to exist. +## Plan Mode Safe Operations + +When in plan mode, these operations are always allowed because they produce +artifacts that inform the plan, not code changes: + +- `$B` commands (browse: screenshots, page inspection, navigation, snapshots) +- `$D` commands (design: generate mockups, variants, comparison boards, iterate) +- `codex exec` / `codex review` (outside voice, plan review, adversarial challenge) +- Writing to `~/.gstack/` (config, analytics, review logs, design artifacts, learnings) +- Writing to the plan file (already allowed by plan mode) +- `open` commands for viewing generated artifacts (comparison boards, HTML previews) + +These are read-only in spirit — they inspect the live site, generate visual artifacts, +or get independent opinions. They do NOT modify project source files. + ## Plan Status Footer When you are in plan mode and about to call ExitPlanMode: diff --git a/learn/SKILL.md b/learn/SKILL.md index 324b4a38781ad85ebc9acd5dafe47ece744d7b80..9d344eebd3e6790ccfb234962237c5a01ee819e5 100644 --- a/learn/SKILL.md +++ b/learn/SKILL.md @@ -355,6 +355,21 @@ If you cannot determine the outcome, use "unknown". Both local JSONL and remote telemetry only run if telemetry is not off. The remote binary additionally requires the binary to exist. +## Plan Mode Safe Operations + +When in plan mode, these operations are always allowed because they produce +artifacts that inform the plan, not code changes: + +- `$B` commands (browse: screenshots, page inspection, navigation, snapshots) +- `$D` commands (design: generate mockups, variants, comparison boards, iterate) +- `codex exec` / `codex review` (outside voice, plan review, adversarial challenge) +- Writing to `~/.gstack/` (config, analytics, review logs, design artifacts, learnings) +- Writing to the plan file (already allowed by plan mode) +- `open` commands for viewing generated artifacts (comparison boards, HTML previews) + +These are read-only in spirit — they inspect the live site, generate visual artifacts, +or get independent opinions. They do NOT modify project source files. + ## Plan Status Footer When you are in plan mode and about to call ExitPlanMode: diff --git a/office-hours/SKILL.md b/office-hours/SKILL.md index 900fb5073482fe17d911090856ccc25be133158d..9c8de4ce66cfdd11b035170d1149aeb432694c5e 100644 --- a/office-hours/SKILL.md +++ b/office-hours/SKILL.md @@ -380,6 +380,21 @@ If you cannot determine the outcome, use "unknown". Both local JSONL and remote telemetry only run if telemetry is not off. The remote binary additionally requires the binary to exist. +## Plan Mode Safe Operations + +When in plan mode, these operations are always allowed because they produce +artifacts that inform the plan, not code changes: + +- `$B` commands (browse: screenshots, page inspection, navigation, snapshots) +- `$D` commands (design: generate mockups, variants, comparison boards, iterate) +- `codex exec` / `codex review` (outside voice, plan review, adversarial challenge) +- Writing to `~/.gstack/` (config, analytics, review logs, design artifacts, learnings) +- Writing to the plan file (already allowed by plan mode) +- `open` commands for viewing generated artifacts (comparison boards, HTML previews) + +These are read-only in spirit — they inspect the live site, generate visual artifacts, +or get independent opinions. They do NOT modify project source files. + ## Plan Status Footer When you are in plan mode and about to call ExitPlanMode: diff --git a/plan-ceo-review/SKILL.md b/plan-ceo-review/SKILL.md index c7631669357915244188dd940525715502773e7c..48a8ab4097986f23bf19f6e7a13a94baaced9ca5 100644 --- a/plan-ceo-review/SKILL.md +++ b/plan-ceo-review/SKILL.md @@ -376,6 +376,21 @@ If you cannot determine the outcome, use "unknown". Both local JSONL and remote telemetry only run if telemetry is not off. The remote binary additionally requires the binary to exist. +## Plan Mode Safe Operations + +When in plan mode, these operations are always allowed because they produce +artifacts that inform the plan, not code changes: + +- `$B` commands (browse: screenshots, page inspection, navigation, snapshots) +- `$D` commands (design: generate mockups, variants, comparison boards, iterate) +- `codex exec` / `codex review` (outside voice, plan review, adversarial challenge) +- Writing to `~/.gstack/` (config, analytics, review logs, design artifacts, learnings) +- Writing to the plan file (already allowed by plan mode) +- `open` commands for viewing generated artifacts (comparison boards, HTML previews) + +These are read-only in spirit — they inspect the live site, generate visual artifacts, +or get independent opinions. They do NOT modify project source files. + ## Plan Status Footer When you are in plan mode and about to call ExitPlanMode: @@ -1295,6 +1310,9 @@ For each substantive tension point, use AskUserQuestion: > "Cross-model disagreement on [topic]. The review found [X] but the outside voice > argues [Y]. [One sentence on what context you might be missing.]" +> +> RECOMMENDATION: Choose [A or B] because [one-line reason explaining which argument +> is more compelling and why]. Completeness: A=X/10, B=Y/10. Options: - A) Accept the outside voice's recommendation (I'll apply this change) @@ -1504,7 +1522,7 @@ Display: - **Eng Review (required by default):** The only review that gates shipping. Covers architecture, code quality, tests, performance. Can be disabled globally with \`gstack-config set skip_eng_review true\` (the "don't bother me" setting). - **CEO Review (optional):** Use your judgment. Recommend it for big product/business changes, new user-facing features, or scope decisions. Skip for bug fixes, refactors, infra, and cleanup. - **Design Review (optional):** Use your judgment. Recommend it for UI/UX changes. Skip for backend-only, infra, or prompt-only changes. -- **Adversarial Review (automatic):** Auto-scales by diff size. Small diffs (<50 lines) skip adversarial. Medium diffs (50–199) get cross-model adversarial. Large diffs (200+) get all 4 passes: Claude structured, Codex structured, Claude adversarial subagent, Codex adversarial. No configuration needed. +- **Adversarial Review (automatic):** Always-on for every review. Every diff gets both Claude adversarial subagent and Codex adversarial challenge. Large diffs (200+ lines) additionally get Codex structured review with P1 gate. No configuration needed. - **Outside Voice (optional):** Independent plan review from a different AI model. Offered after all review sections complete in /plan-ceo-review and /plan-eng-review. Falls back to Claude subagent if Codex is unavailable. Never gates shipping. **Verdict logic:** diff --git a/plan-design-review/SKILL.md b/plan-design-review/SKILL.md index a8c7f26410244c0a67d0b3f171d1c05addf55697..3c973b108b361ec38716f443a2a41844336d6048 100644 --- a/plan-design-review/SKILL.md +++ b/plan-design-review/SKILL.md @@ -374,6 +374,21 @@ If you cannot determine the outcome, use "unknown". Both local JSONL and remote telemetry only run if telemetry is not off. The remote binary additionally requires the binary to exist. +## Plan Mode Safe Operations + +When in plan mode, these operations are always allowed because they produce +artifacts that inform the plan, not code changes: + +- `$B` commands (browse: screenshots, page inspection, navigation, snapshots) +- `$D` commands (design: generate mockups, variants, comparison boards, iterate) +- `codex exec` / `codex review` (outside voice, plan review, adversarial challenge) +- Writing to `~/.gstack/` (config, analytics, review logs, design artifacts, learnings) +- Writing to the plan file (already allowed by plan mode) +- `open` commands for viewing generated artifacts (comparison boards, HTML previews) + +These are read-only in spirit — they inspect the live site, generate visual artifacts, +or get independent opinions. They do NOT modify project source files. + ## Plan Status Footer When you are in plan mode and about to call ExitPlanMode: @@ -1203,7 +1218,7 @@ Display: - **Eng Review (required by default):** The only review that gates shipping. Covers architecture, code quality, tests, performance. Can be disabled globally with \`gstack-config set skip_eng_review true\` (the "don't bother me" setting). - **CEO Review (optional):** Use your judgment. Recommend it for big product/business changes, new user-facing features, or scope decisions. Skip for bug fixes, refactors, infra, and cleanup. - **Design Review (optional):** Use your judgment. Recommend it for UI/UX changes. Skip for backend-only, infra, or prompt-only changes. -- **Adversarial Review (automatic):** Auto-scales by diff size. Small diffs (<50 lines) skip adversarial. Medium diffs (50–199) get cross-model adversarial. Large diffs (200+) get all 4 passes: Claude structured, Codex structured, Claude adversarial subagent, Codex adversarial. No configuration needed. +- **Adversarial Review (automatic):** Always-on for every review. Every diff gets both Claude adversarial subagent and Codex adversarial challenge. Large diffs (200+ lines) additionally get Codex structured review with P1 gate. No configuration needed. - **Outside Voice (optional):** Independent plan review from a different AI model. Offered after all review sections complete in /plan-ceo-review and /plan-eng-review. Falls back to Claude subagent if Codex is unavailable. Never gates shipping. **Verdict logic:** diff --git a/plan-eng-review/SKILL.md b/plan-eng-review/SKILL.md index 1dad9fc0b95ee377d00827413e8c5f9cf540211d..d2715aac09bba138b3923b409832cb1bbd08012a 100644 --- a/plan-eng-review/SKILL.md +++ b/plan-eng-review/SKILL.md @@ -375,6 +375,21 @@ If you cannot determine the outcome, use "unknown". Both local JSONL and remote telemetry only run if telemetry is not off. The remote binary additionally requires the binary to exist. +## Plan Mode Safe Operations + +When in plan mode, these operations are always allowed because they produce +artifacts that inform the plan, not code changes: + +- `$B` commands (browse: screenshots, page inspection, navigation, snapshots) +- `$D` commands (design: generate mockups, variants, comparison boards, iterate) +- `codex exec` / `codex review` (outside voice, plan review, adversarial challenge) +- Writing to `~/.gstack/` (config, analytics, review logs, design artifacts, learnings) +- Writing to the plan file (already allowed by plan mode) +- `open` commands for viewing generated artifacts (comparison boards, HTML previews) + +These are read-only in spirit — they inspect the live site, generate visual artifacts, +or get independent opinions. They do NOT modify project source files. + ## Plan Status Footer When you are in plan mode and about to call ExitPlanMode: @@ -963,6 +978,9 @@ For each substantive tension point, use AskUserQuestion: > "Cross-model disagreement on [topic]. The review found [X] but the outside voice > argues [Y]. [One sentence on what context you might be missing.]" +> +> RECOMMENDATION: Choose [A or B] because [one-line reason explaining which argument +> is more compelling and why]. Completeness: A=X/10, B=Y/10. Options: - A) Accept the outside voice's recommendation (I'll apply this change) @@ -1149,7 +1167,7 @@ Display: - **Eng Review (required by default):** The only review that gates shipping. Covers architecture, code quality, tests, performance. Can be disabled globally with \`gstack-config set skip_eng_review true\` (the "don't bother me" setting). - **CEO Review (optional):** Use your judgment. Recommend it for big product/business changes, new user-facing features, or scope decisions. Skip for bug fixes, refactors, infra, and cleanup. - **Design Review (optional):** Use your judgment. Recommend it for UI/UX changes. Skip for backend-only, infra, or prompt-only changes. -- **Adversarial Review (automatic):** Auto-scales by diff size. Small diffs (<50 lines) skip adversarial. Medium diffs (50–199) get cross-model adversarial. Large diffs (200+) get all 4 passes: Claude structured, Codex structured, Claude adversarial subagent, Codex adversarial. No configuration needed. +- **Adversarial Review (automatic):** Always-on for every review. Every diff gets both Claude adversarial subagent and Codex adversarial challenge. Large diffs (200+ lines) additionally get Codex structured review with P1 gate. No configuration needed. - **Outside Voice (optional):** Independent plan review from a different AI model. Offered after all review sections complete in /plan-ceo-review and /plan-eng-review. Falls back to Claude subagent if Codex is unavailable. Never gates shipping. **Verdict logic:** diff --git a/qa-only/SKILL.md b/qa-only/SKILL.md index 2beb599a57c63e10783555351339db9db1d931ad..63c970ad63a27c819ad1ab04d91900f31e162e14 100644 --- a/qa-only/SKILL.md +++ b/qa-only/SKILL.md @@ -371,6 +371,21 @@ If you cannot determine the outcome, use "unknown". Both local JSONL and remote telemetry only run if telemetry is not off. The remote binary additionally requires the binary to exist. +## Plan Mode Safe Operations + +When in plan mode, these operations are always allowed because they produce +artifacts that inform the plan, not code changes: + +- `$B` commands (browse: screenshots, page inspection, navigation, snapshots) +- `$D` commands (design: generate mockups, variants, comparison boards, iterate) +- `codex exec` / `codex review` (outside voice, plan review, adversarial challenge) +- Writing to `~/.gstack/` (config, analytics, review logs, design artifacts, learnings) +- Writing to the plan file (already allowed by plan mode) +- `open` commands for viewing generated artifacts (comparison boards, HTML previews) + +These are read-only in spirit — they inspect the live site, generate visual artifacts, +or get independent opinions. They do NOT modify project source files. + ## Plan Status Footer When you are in plan mode and about to call ExitPlanMode: diff --git a/qa/SKILL.md b/qa/SKILL.md index 94081f20e77a807e466f6e42f6052ec32bdf5a48..e2a0322632dc42ada1132fa4284d0fc5e07f25ec 100644 --- a/qa/SKILL.md +++ b/qa/SKILL.md @@ -377,6 +377,21 @@ If you cannot determine the outcome, use "unknown". Both local JSONL and remote telemetry only run if telemetry is not off. The remote binary additionally requires the binary to exist. +## Plan Mode Safe Operations + +When in plan mode, these operations are always allowed because they produce +artifacts that inform the plan, not code changes: + +- `$B` commands (browse: screenshots, page inspection, navigation, snapshots) +- `$D` commands (design: generate mockups, variants, comparison boards, iterate) +- `codex exec` / `codex review` (outside voice, plan review, adversarial challenge) +- Writing to `~/.gstack/` (config, analytics, review logs, design artifacts, learnings) +- Writing to the plan file (already allowed by plan mode) +- `open` commands for viewing generated artifacts (comparison boards, HTML previews) + +These are read-only in spirit — they inspect the live site, generate visual artifacts, +or get independent opinions. They do NOT modify project source files. + ## Plan Status Footer When you are in plan mode and about to call ExitPlanMode: diff --git a/retro/SKILL.md b/retro/SKILL.md index 5a84039b140189f0e1893cb3d301ecc410df522e..52af68daf460dc0fed5d46671ce1306741b59337 100644 --- a/retro/SKILL.md +++ b/retro/SKILL.md @@ -353,6 +353,21 @@ If you cannot determine the outcome, use "unknown". Both local JSONL and remote telemetry only run if telemetry is not off. The remote binary additionally requires the binary to exist. +## Plan Mode Safe Operations + +When in plan mode, these operations are always allowed because they produce +artifacts that inform the plan, not code changes: + +- `$B` commands (browse: screenshots, page inspection, navigation, snapshots) +- `$D` commands (design: generate mockups, variants, comparison boards, iterate) +- `codex exec` / `codex review` (outside voice, plan review, adversarial challenge) +- Writing to `~/.gstack/` (config, analytics, review logs, design artifacts, learnings) +- Writing to the plan file (already allowed by plan mode) +- `open` commands for viewing generated artifacts (comparison boards, HTML previews) + +These are read-only in spirit — they inspect the live site, generate visual artifacts, +or get independent opinions. They do NOT modify project source files. + ## Plan Status Footer When you are in plan mode and about to call ExitPlanMode: diff --git a/review/SKILL.md b/review/SKILL.md index 3f492d21fffaf4c53c59a82af12cc2182612f462..3216c7475b711a3170d90c1c197641dbe5087274 100644 --- a/review/SKILL.md +++ b/review/SKILL.md @@ -374,6 +374,21 @@ If you cannot determine the outcome, use "unknown". Both local JSONL and remote telemetry only run if telemetry is not off. The remote binary additionally requires the binary to exist. +## Plan Mode Safe Operations + +When in plan mode, these operations are always allowed because they produce +artifacts that inform the plan, not code changes: + +- `$B` commands (browse: screenshots, page inspection, navigation, snapshots) +- `$D` commands (design: generate mockups, variants, comparison boards, iterate) +- `codex exec` / `codex review` (outside voice, plan review, adversarial challenge) +- Writing to `~/.gstack/` (config, analytics, review logs, design artifacts, learnings) +- Writing to the plan file (already allowed by plan mode) +- `open` commands for viewing generated artifacts (comparison boards, HTML previews) + +These are read-only in spirit — they inspect the live site, generate visual artifacts, +or get independent opinions. They do NOT modify project source files. + ## Plan Status Footer When you are in plan mode and about to call ExitPlanMode: @@ -473,6 +488,31 @@ Before reviewing code quality, check: **did they build what was requested — no 2. Identify the **stated intent** — what was this branch supposed to accomplish? 3. Run `git diff origin/...HEAD --stat` and compare the files changed against the stated intent. +4. Evaluate with skepticism (incorporating plan completion results if available from an earlier step or adjacent section): + + **SCOPE CREEP detection:** + - Files changed that are unrelated to the stated intent + - New features or refactors not mentioned in the plan + - "While I was in there..." changes that expand blast radius + + **MISSING REQUIREMENTS detection:** + - Requirements from TODOS.md/PR description not addressed in the diff + - Test coverage gaps for stated requirements + - Partial implementations (started but not finished) + +5. Output (before the main review begins): + \`\`\` + Scope Check: [CLEAN / DRIFT DETECTED / REQUIREMENTS MISSING] + Intent: <1-line summary of what was requested> + Delivered: <1-line summary of what the diff actually does> + [If drift: list each out-of-scope change] + [If missing: list each unaddressed requirement] + \`\`\` + +6. This is **INFORMATIONAL** — does not block the review. Proceed to the next step. + +--- + ### Plan File Discovery 1. **Conversation context (primary):** Check if there is an active plan file in this conversation. The host agent's system messages include plan file paths when in plan mode. If found, use it directly — this is the most reliable signal. @@ -591,31 +631,6 @@ Plan items: N DONE, M PARTIAL, K NOT DONE **No plan file found:** Fall back to existing scope drift behavior (check TODOS.md and PR description only). -4. Evaluate with skepticism (incorporating plan completion results if available): - - **SCOPE CREEP detection:** - - Files changed that are unrelated to the stated intent - - New features or refactors not mentioned in the plan - - "While I was in there..." changes that expand blast radius - - **MISSING REQUIREMENTS detection:** - - Requirements from TODOS.md/PR description not addressed in the diff - - Test coverage gaps for stated requirements - - Partial implementations (started but not finished) - -5. Output (before the main review begins): - ``` - Scope Check: [CLEAN / DRIFT DETECTED / REQUIREMENTS MISSING] - Intent: <1-line summary of what was requested> - Delivered: <1-line summary of what the diff actually does> - [If drift: list each out-of-scope change] - [If missing: list each unaddressed requirement] - ``` - -6. This is **INFORMATIONAL** — does not block the review. Proceed to Step 2. - ---- - ## Step 2: Read the checklist Read `.claude/skills/review/checklist.md`. @@ -1096,9 +1111,9 @@ If no documentation files exist, skip this step silently. --- -## Step 5.7: Adversarial review (auto-scaled) +## Step 5.7: Adversarial review (always-on) -Adversarial review thoroughness scales automatically based on diff size. No configuration needed. +Every diff gets adversarial review from both Claude and Codex. LOC is not a proxy for risk — a 5-line auth change can be critical. **Detect diff size and tool availability:** @@ -1107,30 +1122,34 @@ DIFF_INS=$(git diff origin/ --stat | tail -1 | grep -oE '[0-9]+ insertion' DIFF_DEL=$(git diff origin/ --stat | tail -1 | grep -oE '[0-9]+ deletion' | grep -oE '[0-9]+' || echo "0") DIFF_TOTAL=$((DIFF_INS + DIFF_DEL)) which codex 2>/dev/null && echo "CODEX_AVAILABLE" || echo "CODEX_NOT_AVAILABLE" -# Respect old opt-out +# Legacy opt-out — only gates Codex passes, Claude always runs OLD_CFG=$(~/.claude/skills/gstack/bin/gstack-config get codex_reviews 2>/dev/null || true) echo "DIFF_SIZE: $DIFF_TOTAL" echo "OLD_CFG: ${OLD_CFG:-not_set}" ``` -If `OLD_CFG` is `disabled`: skip this step silently. Continue to the next step. +If `OLD_CFG` is `disabled`: skip Codex passes only. Claude adversarial subagent still runs (it's free and fast). Jump to the "Claude adversarial subagent" section. -**User override:** If the user explicitly requested a specific tier (e.g., "run all passes", "paranoid review", "full adversarial", "do all 4 passes", "thorough review"), honor that request regardless of diff size. Jump to the matching tier section. - -**Auto-select tier based on diff size:** -- **Small (< 50 lines changed):** Skip adversarial review entirely. Print: "Small diff ($DIFF_TOTAL lines) — adversarial review skipped." Continue to the next step. -- **Medium (50–199 lines changed):** Run Codex adversarial challenge (or Claude adversarial subagent if Codex unavailable). Jump to the "Medium tier" section. -- **Large (200+ lines changed):** Run all remaining passes — Codex structured review + Claude adversarial subagent + Codex adversarial. Jump to the "Large tier" section. +**User override:** If the user explicitly requested "full review", "structured review", or "P1 gate", also run the Codex structured review regardless of diff size. --- -### Medium tier (50–199 lines) +### Claude adversarial subagent (always runs) + +Dispatch via the Agent tool. The subagent has fresh context — no checklist bias from the structured review. This genuine independence catches things the primary reviewer is blind to. + +Subagent prompt: +"Read the diff for this branch with `git diff origin/`. Think like an attacker and a chaos engineer. Your job is to find ways this code will fail in production. Look for: edge cases, race conditions, security holes, resource leaks, failure modes, silent data corruption, logic errors that produce wrong results silently, error handling that swallows failures, and trust boundary violations. Be adversarial. Be thorough. No compliments — just the problems. For each finding, classify as FIXABLE (you know how to fix it) or INVESTIGATE (needs human judgment)." + +Present findings under an `ADVERSARIAL REVIEW (Claude subagent):` header. **FIXABLE findings** flow into the same Fix-First pipeline as the structured review. **INVESTIGATE findings** are presented as informational. -Claude's structured review already ran. Now add a **cross-model adversarial challenge**. +If the subagent fails or times out: "Claude adversarial subagent unavailable. Continuing." -**If Codex is available:** run the Codex adversarial challenge. **If Codex is NOT available:** fall back to the Claude adversarial subagent instead. +--- -**Codex adversarial:** +### Codex adversarial challenge (always runs when available) + +If Codex is available AND `OLD_CFG` is NOT `disabled`: ```bash TMPERR_ADV=$(mktemp /tmp/codex-adv-XXXXXXXX) @@ -1150,34 +1169,16 @@ Present the full output verbatim. This is informational — it never blocks ship - **Timeout:** "Codex timed out after 5 minutes." - **Empty response:** "Codex returned no response. Stderr: ." -On any Codex error, fall back to the Claude adversarial subagent automatically. - -**Claude adversarial subagent** (fallback when Codex unavailable or errored): +**Cleanup:** Run `rm -f "$TMPERR_ADV"` after processing. -Dispatch via the Agent tool. The subagent has fresh context — no checklist bias from the structured review. This genuine independence catches things the primary reviewer is blind to. - -Subagent prompt: -"Read the diff for this branch with `git diff origin/`. Think like an attacker and a chaos engineer. Your job is to find ways this code will fail in production. Look for: edge cases, race conditions, security holes, resource leaks, failure modes, silent data corruption, logic errors that produce wrong results silently, error handling that swallows failures, and trust boundary violations. Be adversarial. Be thorough. No compliments — just the problems. For each finding, classify as FIXABLE (you know how to fix it) or INVESTIGATE (needs human judgment)." - -Present findings under an `ADVERSARIAL REVIEW (Claude subagent):` header. **FIXABLE findings** flow into the same Fix-First pipeline as the structured review. **INVESTIGATE findings** are presented as informational. - -If the subagent fails or times out: "Claude adversarial subagent unavailable. Continuing without adversarial review." - -**Persist the review result:** -```bash -~/.claude/skills/gstack/bin/gstack-review-log '{"skill":"adversarial-review","timestamp":"'"$(date -u +%Y-%m-%dT%H:%M:%SZ)"'","status":"STATUS","source":"SOURCE","tier":"medium","commit":"'"$(git rev-parse --short HEAD)"'"}' -``` -Substitute STATUS: "clean" if no findings, "issues_found" if findings exist. SOURCE: "codex" if Codex ran, "claude" if subagent ran. If both failed, do NOT persist. - -**Cleanup:** Run `rm -f "$TMPERR_ADV"` after processing (if Codex was used). +If Codex is NOT available: "Codex CLI not found — running Claude adversarial only. Install Codex for cross-model coverage: `npm install -g @openai/codex`" --- -### Large tier (200+ lines) +### Codex structured review (large diffs only, 200+ lines) -Claude's structured review already ran. Now run **all three remaining passes** for maximum coverage: +If `DIFF_TOTAL >= 200` AND Codex is available AND `OLD_CFG` is NOT `disabled`: -**1. Codex structured review (if available):** ```bash TMPERR=$(mktemp /tmp/codex-review-XXXXXXXX) _REPO_ROOT=$(git rev-parse --show-toplevel) || { echo "ERROR: not in a git repo" >&2; exit 1; } @@ -1198,34 +1199,34 @@ B) Continue — review will still complete If A: address the findings. Re-run `codex review` to verify. -Read stderr for errors (same error handling as medium tier). +Read stderr for errors (same error handling as Codex adversarial above). After stderr: `rm -f "$TMPERR"` -**2. Claude adversarial subagent:** Dispatch a subagent with the adversarial prompt (same prompt as medium tier). This always runs regardless of Codex availability. +If `DIFF_TOTAL < 200`: skip this section silently. The Claude + Codex adversarial passes provide sufficient coverage for smaller diffs. -**3. Codex adversarial challenge (if available):** Run `codex exec` with the adversarial prompt (same as medium tier). +--- -If Codex is not available for steps 1 and 3, note to the user: "Codex CLI not found — large-diff review ran Claude structured + Claude adversarial (2 of 4 passes). Install Codex for full 4-pass coverage: `npm install -g @openai/codex`" +### Persist the review result -**Persist the review result AFTER all passes complete** (not after each sub-step): +After all passes complete, persist: ```bash -~/.claude/skills/gstack/bin/gstack-review-log '{"skill":"adversarial-review","timestamp":"'"$(date -u +%Y-%m-%dT%H:%M:%SZ)"'","status":"STATUS","source":"SOURCE","tier":"large","gate":"GATE","commit":"'"$(git rev-parse --short HEAD)"'"}' +~/.claude/skills/gstack/bin/gstack-review-log '{"skill":"adversarial-review","timestamp":"'"$(date -u +%Y-%m-%dT%H:%M:%SZ)"'","status":"STATUS","source":"SOURCE","tier":"always","gate":"GATE","commit":"'"$(git rev-parse --short HEAD)"'"}' ``` -Substitute: STATUS = "clean" if no findings across ALL passes, "issues_found" if any pass found issues. SOURCE = "both" if Codex ran, "claude" if only Claude subagent ran. GATE = the Codex structured review gate result ("pass"/"fail"), or "informational" if Codex was unavailable. If all passes failed, do NOT persist. +Substitute: STATUS = "clean" if no findings across ALL passes, "issues_found" if any pass found issues. SOURCE = "both" if Codex ran, "claude" if only Claude subagent ran. GATE = the Codex structured review gate result ("pass"/"fail"), "skipped" if diff < 200, or "informational" if Codex was unavailable. If all passes failed, do NOT persist. --- -### Cross-model synthesis (medium and large tiers) +### Cross-model synthesis After all passes complete, synthesize findings across all sources: ``` -ADVERSARIAL REVIEW SYNTHESIS (auto: TIER, N lines): +ADVERSARIAL REVIEW SYNTHESIS (always-on, N lines): ════════════════════════════════════════════════════════════ High confidence (found by multiple sources): [findings agreed on by >1 pass] Unique to Claude structured review: [from earlier step] - Unique to Claude adversarial: [from subagent, if ran] + Unique to Claude adversarial: [from subagent] Unique to Codex: [from codex adversarial or code review, if ran] Models used: Claude structured ✓ Claude adversarial ✓/✗ Codex ✓/✗ ════════════════════════════════════════════════════════════ diff --git a/review/SKILL.md.tmpl b/review/SKILL.md.tmpl index b748483a09a16d947806ee430c9a56f4778ead81..7fb881d67873bc47963ce25ca6574c5d86c93b70 100644 --- a/review/SKILL.md.tmpl +++ b/review/SKILL.md.tmpl @@ -37,43 +37,10 @@ You are running the `/review` workflow. Analyze the current branch's diff agains --- -## Step 1.5: Scope Drift Detection - -Before reviewing code quality, check: **did they build what was requested — nothing more, nothing less?** - -1. Read `TODOS.md` (if it exists). Read PR description (`gh pr view --json body --jq .body 2>/dev/null || true`). - Read commit messages (`git log origin/..HEAD --oneline`). - **If no PR exists:** rely on commit messages and TODOS.md for stated intent — this is the common case since /review runs before /ship creates the PR. -2. Identify the **stated intent** — what was this branch supposed to accomplish? -3. Run `git diff origin/...HEAD --stat` and compare the files changed against the stated intent. +{{SCOPE_DRIFT}} {{PLAN_COMPLETION_AUDIT_REVIEW}} -4. Evaluate with skepticism (incorporating plan completion results if available): - - **SCOPE CREEP detection:** - - Files changed that are unrelated to the stated intent - - New features or refactors not mentioned in the plan - - "While I was in there..." changes that expand blast radius - - **MISSING REQUIREMENTS detection:** - - Requirements from TODOS.md/PR description not addressed in the diff - - Test coverage gaps for stated requirements - - Partial implementations (started but not finished) - -5. Output (before the main review begins): - ``` - Scope Check: [CLEAN / DRIFT DETECTED / REQUIREMENTS MISSING] - Intent: <1-line summary of what was requested> - Delivered: <1-line summary of what the diff actually does> - [If drift: list each out-of-scope change] - [If missing: list each unaddressed requirement] - ``` - -6. This is **INFORMATIONAL** — does not block the review. Proceed to Step 2. - ---- - ## Step 2: Read the checklist Read `.claude/skills/review/checklist.md`. diff --git a/scripts/resolvers/index.ts b/scripts/resolvers/index.ts index 7ac7f1a25f5ba544c2c27c057e9bb6d90b1ed4e1..297b89c973818e45ecd5dd1415a02a1569ea0c66 100644 --- a/scripts/resolvers/index.ts +++ b/scripts/resolvers/index.ts @@ -11,7 +11,7 @@ import { generateTestFailureTriage } from './preamble'; import { generateCommandReference, generateSnapshotFlags, generateBrowseSetup } from './browse'; import { generateDesignMethodology, generateDesignHardRules, generateDesignOutsideVoices, generateDesignReviewLite, generateDesignSketch, generateDesignSetup, generateDesignMockup, generateDesignShotgunLoop } from './design'; import { generateTestBootstrap, generateTestCoverageAuditPlan, generateTestCoverageAuditShip, generateTestCoverageAuditReview } from './testing'; -import { generateReviewDashboard, generatePlanFileReviewReport, generateSpecReviewLoop, generateBenefitsFrom, generateCodexSecondOpinion, generateAdversarialStep, generateCodexPlanReview, generatePlanCompletionAuditShip, generatePlanCompletionAuditReview, generatePlanVerificationExec } from './review'; +import { generateReviewDashboard, generatePlanFileReviewReport, generateSpecReviewLoop, generateBenefitsFrom, generateCodexSecondOpinion, generateAdversarialStep, generateCodexPlanReview, generatePlanCompletionAuditShip, generatePlanCompletionAuditReview, generatePlanVerificationExec, generateScopeDrift } from './review'; import { generateSlugEval, generateSlugSetup, generateBaseBranchDetect, generateDeployBootstrap, generateQAMethodology, generateCoAuthorTrailer, generateChangelogWorkflow } from './utility'; import { generateLearningsSearch, generateLearningsLog } from './learnings'; import { generateConfidenceCalibration } from './confidence'; @@ -45,6 +45,7 @@ export const RESOLVERS: Record = { BENEFITS_FROM: generateBenefitsFrom, CODEX_SECOND_OPINION: generateCodexSecondOpinion, ADVERSARIAL_STEP: generateAdversarialStep, + SCOPE_DRIFT: generateScopeDrift, DEPLOY_BOOTSTRAP: generateDeployBootstrap, CODEX_PLAN_REVIEW: generateCodexPlanReview, PLAN_COMPLETION_AUDIT_SHIP: generatePlanCompletionAuditShip, diff --git a/scripts/resolvers/preamble.ts b/scripts/resolvers/preamble.ts index 8cd1b5572b4055237aed640a47006fce0d6af95e..7b144b2f8bd95e5bd3d0517c90d36455f14b9071 100644 --- a/scripts/resolvers/preamble.ts +++ b/scripts/resolvers/preamble.ts @@ -460,6 +460,21 @@ If you cannot determine the outcome, use "unknown". Both local JSONL and remote telemetry only run if telemetry is not off. The remote binary additionally requires the binary to exist. +## Plan Mode Safe Operations + +When in plan mode, these operations are always allowed because they produce +artifacts that inform the plan, not code changes: + +- \`$B\` commands (browse: screenshots, page inspection, navigation, snapshots) +- \`$D\` commands (design: generate mockups, variants, comparison boards, iterate) +- \`codex exec\` / \`codex review\` (outside voice, plan review, adversarial challenge) +- Writing to \`~/.gstack/\` (config, analytics, review logs, design artifacts, learnings) +- Writing to the plan file (already allowed by plan mode) +- \`open\` commands for viewing generated artifacts (comparison boards, HTML previews) + +These are read-only in spirit — they inspect the live site, generate visual artifacts, +or get independent opinions. They do NOT modify project source files. + ## Plan Status Footer When you are in plan mode and about to call ExitPlanMode: diff --git a/scripts/resolvers/review.ts b/scripts/resolvers/review.ts index 5db226444885ff927075dcc6a7e18af9cc7b8583..65f99fb2599565377c5bca6defc27caaea49d64e 100644 --- a/scripts/resolvers/review.ts +++ b/scripts/resolvers/review.ts @@ -54,7 +54,7 @@ Display: - **Eng Review (required by default):** The only review that gates shipping. Covers architecture, code quality, tests, performance. Can be disabled globally with \\\`gstack-config set skip_eng_review true\\\` (the "don't bother me" setting). - **CEO Review (optional):** Use your judgment. Recommend it for big product/business changes, new user-facing features, or scope decisions. Skip for bug fixes, refactors, infra, and cleanup. - **Design Review (optional):** Use your judgment. Recommend it for UI/UX changes. Skip for backend-only, infra, or prompt-only changes. -- **Adversarial Review (automatic):** Auto-scales by diff size. Small diffs (<50 lines) skip adversarial. Medium diffs (50–199) get cross-model adversarial. Large diffs (200+) get all 4 passes: Claude structured, Codex structured, Claude adversarial subagent, Codex adversarial. No configuration needed. +- **Adversarial Review (automatic):** Always-on for every review. Every diff gets both Claude adversarial subagent and Codex adversarial challenge. Large diffs (200+ lines) additionally get Codex structured review with P1 gate. No configuration needed. - **Outside Voice (optional):** Independent plan review from a different AI model. Offered after all review sections complete in /plan-ceo-review and /plan-eng-review. Falls back to Claude subagent if Codex is unavailable. Never gates shipping. **Verdict logic:** @@ -359,6 +359,50 @@ SECOND OPINION (Claude subagent): If A: revise the premise and note the revision. If B: proceed (and note that the user defended this premise with reasoning — this is a founder signal if they articulate WHY they disagree, not just dismiss).`; } +// ─── Scope Drift Detection (shared between /review and /ship) ──────── + +export function generateScopeDrift(ctx: TemplateContext): string { + const isShip = ctx.skillName === 'ship'; + const stepNum = isShip ? '3.48' : '1.5'; + + return `## Step ${stepNum}: Scope Drift Detection + +Before reviewing code quality, check: **did they build what was requested — nothing more, nothing less?** + +1. Read \`TODOS.md\` (if it exists). Read PR description (\`gh pr view --json body --jq .body 2>/dev/null || true\`). + Read commit messages (\`git log origin/..HEAD --oneline\`). + **If no PR exists:** rely on commit messages and TODOS.md for stated intent — this is the common case since /review runs before /ship creates the PR. +2. Identify the **stated intent** — what was this branch supposed to accomplish? +3. Run \`git diff origin/...HEAD --stat\` and compare the files changed against the stated intent. + +4. Evaluate with skepticism (incorporating plan completion results if available from an earlier step or adjacent section): + + **SCOPE CREEP detection:** + - Files changed that are unrelated to the stated intent + - New features or refactors not mentioned in the plan + - "While I was in there..." changes that expand blast radius + + **MISSING REQUIREMENTS detection:** + - Requirements from TODOS.md/PR description not addressed in the diff + - Test coverage gaps for stated requirements + - Partial implementations (started but not finished) + +5. Output (before the main review begins): + \\\`\\\`\\\` + Scope Check: [CLEAN / DRIFT DETECTED / REQUIREMENTS MISSING] + Intent: <1-line summary of what was requested> + Delivered: <1-line summary of what the diff actually does> + [If drift: list each out-of-scope change] + [If missing: list each unaddressed requirement] + \\\`\\\`\\\` + +6. This is **INFORMATIONAL** — does not block the review. Proceed to the next step. + +---`; +} + +// ─── Adversarial Review (always-on) ────────────────────────────────── + export function generateAdversarialStep(ctx: TemplateContext): string { // Codex host: strip entirely — Codex should never invoke itself if (ctx.host === 'codex') return ''; @@ -366,9 +410,9 @@ export function generateAdversarialStep(ctx: TemplateContext): string { const isShip = ctx.skillName === 'ship'; const stepNum = isShip ? '3.8' : '5.7'; - return `## Step ${stepNum}: Adversarial review (auto-scaled) + return `## Step ${stepNum}: Adversarial review (always-on) -Adversarial review thoroughness scales automatically based on diff size. No configuration needed. +Every diff gets adversarial review from both Claude and Codex. LOC is not a proxy for risk — a 5-line auth change can be critical. **Detect diff size and tool availability:** @@ -377,30 +421,34 @@ DIFF_INS=$(git diff origin/ --stat | tail -1 | grep -oE '[0-9]+ insertion' DIFF_DEL=$(git diff origin/ --stat | tail -1 | grep -oE '[0-9]+ deletion' | grep -oE '[0-9]+' || echo "0") DIFF_TOTAL=$((DIFF_INS + DIFF_DEL)) which codex 2>/dev/null && echo "CODEX_AVAILABLE" || echo "CODEX_NOT_AVAILABLE" -# Respect old opt-out +# Legacy opt-out — only gates Codex passes, Claude always runs OLD_CFG=$(~/.claude/skills/gstack/bin/gstack-config get codex_reviews 2>/dev/null || true) echo "DIFF_SIZE: $DIFF_TOTAL" echo "OLD_CFG: \${OLD_CFG:-not_set}" \`\`\` -If \`OLD_CFG\` is \`disabled\`: skip this step silently. Continue to the next step. +If \`OLD_CFG\` is \`disabled\`: skip Codex passes only. Claude adversarial subagent still runs (it's free and fast). Jump to the "Claude adversarial subagent" section. -**User override:** If the user explicitly requested a specific tier (e.g., "run all passes", "paranoid review", "full adversarial", "do all 4 passes", "thorough review"), honor that request regardless of diff size. Jump to the matching tier section. - -**Auto-select tier based on diff size:** -- **Small (< 50 lines changed):** Skip adversarial review entirely. Print: "Small diff ($DIFF_TOTAL lines) — adversarial review skipped." Continue to the next step. -- **Medium (50–199 lines changed):** Run Codex adversarial challenge (or Claude adversarial subagent if Codex unavailable). Jump to the "Medium tier" section. -- **Large (200+ lines changed):** Run all remaining passes — Codex structured review + Claude adversarial subagent + Codex adversarial. Jump to the "Large tier" section. +**User override:** If the user explicitly requested "full review", "structured review", or "P1 gate", also run the Codex structured review regardless of diff size. --- -### Medium tier (50–199 lines) +### Claude adversarial subagent (always runs) -Claude's structured review already ran. Now add a **cross-model adversarial challenge**. +Dispatch via the Agent tool. The subagent has fresh context — no checklist bias from the structured review. This genuine independence catches things the primary reviewer is blind to. -**If Codex is available:** run the Codex adversarial challenge. **If Codex is NOT available:** fall back to the Claude adversarial subagent instead. +Subagent prompt: +"Read the diff for this branch with \`git diff origin/\`. Think like an attacker and a chaos engineer. Your job is to find ways this code will fail in production. Look for: edge cases, race conditions, security holes, resource leaks, failure modes, silent data corruption, logic errors that produce wrong results silently, error handling that swallows failures, and trust boundary violations. Be adversarial. Be thorough. No compliments — just the problems. For each finding, classify as FIXABLE (you know how to fix it) or INVESTIGATE (needs human judgment)." -**Codex adversarial:** +Present findings under an \`ADVERSARIAL REVIEW (Claude subagent):\` header. **FIXABLE findings** flow into the same Fix-First pipeline as the structured review. **INVESTIGATE findings** are presented as informational. + +If the subagent fails or times out: "Claude adversarial subagent unavailable. Continuing." + +--- + +### Codex adversarial challenge (always runs when available) + +If Codex is available AND \`OLD_CFG\` is NOT \`disabled\`: \`\`\`bash TMPERR_ADV=$(mktemp /tmp/codex-adv-XXXXXXXX) @@ -420,34 +468,16 @@ Present the full output verbatim. This is informational — it never blocks ship - **Timeout:** "Codex timed out after 5 minutes." - **Empty response:** "Codex returned no response. Stderr: ." -On any Codex error, fall back to the Claude adversarial subagent automatically. +**Cleanup:** Run \`rm -f "$TMPERR_ADV"\` after processing. -**Claude adversarial subagent** (fallback when Codex unavailable or errored): - -Dispatch via the Agent tool. The subagent has fresh context — no checklist bias from the structured review. This genuine independence catches things the primary reviewer is blind to. - -Subagent prompt: -"Read the diff for this branch with \`git diff origin/\`. Think like an attacker and a chaos engineer. Your job is to find ways this code will fail in production. Look for: edge cases, race conditions, security holes, resource leaks, failure modes, silent data corruption, logic errors that produce wrong results silently, error handling that swallows failures, and trust boundary violations. Be adversarial. Be thorough. No compliments — just the problems. For each finding, classify as FIXABLE (you know how to fix it) or INVESTIGATE (needs human judgment)." - -Present findings under an \`ADVERSARIAL REVIEW (Claude subagent):\` header. **FIXABLE findings** flow into the same Fix-First pipeline as the structured review. **INVESTIGATE findings** are presented as informational. - -If the subagent fails or times out: "Claude adversarial subagent unavailable. Continuing without adversarial review." - -**Persist the review result:** -\`\`\`bash -~/.claude/skills/gstack/bin/gstack-review-log '{"skill":"adversarial-review","timestamp":"'"$(date -u +%Y-%m-%dT%H:%M:%SZ)"'","status":"STATUS","source":"SOURCE","tier":"medium","commit":"'"$(git rev-parse --short HEAD)"'"}' -\`\`\` -Substitute STATUS: "clean" if no findings, "issues_found" if findings exist. SOURCE: "codex" if Codex ran, "claude" if subagent ran. If both failed, do NOT persist. - -**Cleanup:** Run \`rm -f "$TMPERR_ADV"\` after processing (if Codex was used). +If Codex is NOT available: "Codex CLI not found — running Claude adversarial only. Install Codex for cross-model coverage: \`npm install -g @openai/codex\`" --- -### Large tier (200+ lines) +### Codex structured review (large diffs only, 200+ lines) -Claude's structured review already ran. Now run **all three remaining passes** for maximum coverage: +If \`DIFF_TOTAL >= 200\` AND Codex is available AND \`OLD_CFG\` is NOT \`disabled\`: -**1. Codex structured review (if available):** \`\`\`bash TMPERR=$(mktemp /tmp/codex-review-XXXXXXXX) _REPO_ROOT=$(git rev-parse --show-toplevel) || { echo "ERROR: not in a git repo" >&2; exit 1; } @@ -468,34 +498,34 @@ B) Continue — review will still complete If A: address the findings${isShip ? '. After fixing, re-run tests (Step 3) since code has changed' : ''}. Re-run \`codex review\` to verify. -Read stderr for errors (same error handling as medium tier). +Read stderr for errors (same error handling as Codex adversarial above). After stderr: \`rm -f "$TMPERR"\` -**2. Claude adversarial subagent:** Dispatch a subagent with the adversarial prompt (same prompt as medium tier). This always runs regardless of Codex availability. +If \`DIFF_TOTAL < 200\`: skip this section silently. The Claude + Codex adversarial passes provide sufficient coverage for smaller diffs. -**3. Codex adversarial challenge (if available):** Run \`codex exec\` with the adversarial prompt (same as medium tier). +--- -If Codex is not available for steps 1 and 3, note to the user: "Codex CLI not found — large-diff review ran Claude structured + Claude adversarial (2 of 4 passes). Install Codex for full 4-pass coverage: \`npm install -g @openai/codex\`" +### Persist the review result -**Persist the review result AFTER all passes complete** (not after each sub-step): +After all passes complete, persist: \`\`\`bash -~/.claude/skills/gstack/bin/gstack-review-log '{"skill":"adversarial-review","timestamp":"'"$(date -u +%Y-%m-%dT%H:%M:%SZ)"'","status":"STATUS","source":"SOURCE","tier":"large","gate":"GATE","commit":"'"$(git rev-parse --short HEAD)"'"}' +~/.claude/skills/gstack/bin/gstack-review-log '{"skill":"adversarial-review","timestamp":"'"$(date -u +%Y-%m-%dT%H:%M:%SZ)"'","status":"STATUS","source":"SOURCE","tier":"always","gate":"GATE","commit":"'"$(git rev-parse --short HEAD)"'"}' \`\`\` -Substitute: STATUS = "clean" if no findings across ALL passes, "issues_found" if any pass found issues. SOURCE = "both" if Codex ran, "claude" if only Claude subagent ran. GATE = the Codex structured review gate result ("pass"/"fail"), or "informational" if Codex was unavailable. If all passes failed, do NOT persist. +Substitute: STATUS = "clean" if no findings across ALL passes, "issues_found" if any pass found issues. SOURCE = "both" if Codex ran, "claude" if only Claude subagent ran. GATE = the Codex structured review gate result ("pass"/"fail"), "skipped" if diff < 200, or "informational" if Codex was unavailable. If all passes failed, do NOT persist. --- -### Cross-model synthesis (medium and large tiers) +### Cross-model synthesis After all passes complete, synthesize findings across all sources: \`\`\` -ADVERSARIAL REVIEW SYNTHESIS (auto: TIER, N lines): +ADVERSARIAL REVIEW SYNTHESIS (always-on, N lines): ════════════════════════════════════════════════════════════ High confidence (found by multiple sources): [findings agreed on by >1 pass] Unique to Claude structured review: [from earlier step] - Unique to Claude adversarial: [from subagent, if ran] + Unique to Claude adversarial: [from subagent] Unique to Codex: [from codex adversarial or code review, if ran] Models used: Claude structured ✓ Claude adversarial ✓/✗ Codex ✓/✗ ════════════════════════════════════════════════════════════ @@ -619,6 +649,9 @@ For each substantive tension point, use AskUserQuestion: > "Cross-model disagreement on [topic]. The review found [X] but the outside voice > argues [Y]. [One sentence on what context you might be missing.]" +> +> RECOMMENDATION: Choose [A or B] because [one-line reason explaining which argument +> is more compelling and why]. Completeness: A=X/10, B=Y/10. Options: - A) Accept the outside voice's recommendation (I'll apply this change) diff --git a/setup-browser-cookies/SKILL.md b/setup-browser-cookies/SKILL.md index 67657a6ba653a87bee426bf2756056f47dd1166f..9724e928e3196f4a5e490eaebb2d80d71c009ad3 100644 --- a/setup-browser-cookies/SKILL.md +++ b/setup-browser-cookies/SKILL.md @@ -285,6 +285,21 @@ If you cannot determine the outcome, use "unknown". Both local JSONL and remote telemetry only run if telemetry is not off. The remote binary additionally requires the binary to exist. +## Plan Mode Safe Operations + +When in plan mode, these operations are always allowed because they produce +artifacts that inform the plan, not code changes: + +- `$B` commands (browse: screenshots, page inspection, navigation, snapshots) +- `$D` commands (design: generate mockups, variants, comparison boards, iterate) +- `codex exec` / `codex review` (outside voice, plan review, adversarial challenge) +- Writing to `~/.gstack/` (config, analytics, review logs, design artifacts, learnings) +- Writing to the plan file (already allowed by plan mode) +- `open` commands for viewing generated artifacts (comparison boards, HTML previews) + +These are read-only in spirit — they inspect the live site, generate visual artifacts, +or get independent opinions. They do NOT modify project source files. + ## Plan Status Footer When you are in plan mode and about to call ExitPlanMode: diff --git a/setup-deploy/SKILL.md b/setup-deploy/SKILL.md index 9abfb9754b6b42eb42400bd904d7bb2f4691b293..4d174c72a1c5ce90aa1591bd337eba5d7cb08624 100644 --- a/setup-deploy/SKILL.md +++ b/setup-deploy/SKILL.md @@ -356,6 +356,21 @@ If you cannot determine the outcome, use "unknown". Both local JSONL and remote telemetry only run if telemetry is not off. The remote binary additionally requires the binary to exist. +## Plan Mode Safe Operations + +When in plan mode, these operations are always allowed because they produce +artifacts that inform the plan, not code changes: + +- `$B` commands (browse: screenshots, page inspection, navigation, snapshots) +- `$D` commands (design: generate mockups, variants, comparison boards, iterate) +- `codex exec` / `codex review` (outside voice, plan review, adversarial challenge) +- Writing to `~/.gstack/` (config, analytics, review logs, design artifacts, learnings) +- Writing to the plan file (already allowed by plan mode) +- `open` commands for viewing generated artifacts (comparison boards, HTML previews) + +These are read-only in spirit — they inspect the live site, generate visual artifacts, +or get independent opinions. They do NOT modify project source files. + ## Plan Status Footer When you are in plan mode and about to call ExitPlanMode: diff --git a/ship/SKILL.md b/ship/SKILL.md index 4519b6e2d58d6965630825f04906ed3a7c29f2bd..94257f29a6f698163e9c88d4bf33ba3d237a8dd8 100644 --- a/ship/SKILL.md +++ b/ship/SKILL.md @@ -375,6 +375,21 @@ If you cannot determine the outcome, use "unknown". Both local JSONL and remote telemetry only run if telemetry is not off. The remote binary additionally requires the binary to exist. +## Plan Mode Safe Operations + +When in plan mode, these operations are always allowed because they produce +artifacts that inform the plan, not code changes: + +- `$B` commands (browse: screenshots, page inspection, navigation, snapshots) +- `$D` commands (design: generate mockups, variants, comparison boards, iterate) +- `codex exec` / `codex review` (outside voice, plan review, adversarial challenge) +- Writing to `~/.gstack/` (config, analytics, review logs, design artifacts, learnings) +- Writing to the plan file (already allowed by plan mode) +- `open` commands for viewing generated artifacts (comparison boards, HTML previews) + +These are read-only in spirit — they inspect the live site, generate visual artifacts, +or get independent opinions. They do NOT modify project source files. + ## Plan Status Footer When you are in plan mode and about to call ExitPlanMode: @@ -525,7 +540,7 @@ Display: - **Eng Review (required by default):** The only review that gates shipping. Covers architecture, code quality, tests, performance. Can be disabled globally with \`gstack-config set skip_eng_review true\` (the "don't bother me" setting). - **CEO Review (optional):** Use your judgment. Recommend it for big product/business changes, new user-facing features, or scope decisions. Skip for bug fixes, refactors, infra, and cleanup. - **Design Review (optional):** Use your judgment. Recommend it for UI/UX changes. Skip for backend-only, infra, or prompt-only changes. -- **Adversarial Review (automatic):** Auto-scales by diff size. Small diffs (<50 lines) skip adversarial. Medium diffs (50–199) get cross-model adversarial. Large diffs (200+) get all 4 passes: Claude structured, Codex structured, Claude adversarial subagent, Codex adversarial. No configuration needed. +- **Adversarial Review (automatic):** Always-on for every review. Every diff gets both Claude adversarial subagent and Codex adversarial challenge. Large diffs (200+ lines) additionally get Codex structured review with P1 gate. No configuration needed. - **Outside Voice (optional):** Independent plan review from a different AI model. Offered after all review sections complete in /plan-ceo-review and /plan-eng-review. Falls back to Claude subagent if Codex is unavailable. Never gates shipping. **Verdict logic:** @@ -1423,6 +1438,41 @@ matches a past learning, display: This makes the compounding visible. The user should see that gstack is getting smarter on their codebase over time. +## Step 3.48: Scope Drift Detection + +Before reviewing code quality, check: **did they build what was requested — nothing more, nothing less?** + +1. Read `TODOS.md` (if it exists). Read PR description (`gh pr view --json body --jq .body 2>/dev/null || true`). + Read commit messages (`git log origin/..HEAD --oneline`). + **If no PR exists:** rely on commit messages and TODOS.md for stated intent — this is the common case since /review runs before /ship creates the PR. +2. Identify the **stated intent** — what was this branch supposed to accomplish? +3. Run `git diff origin/...HEAD --stat` and compare the files changed against the stated intent. + +4. Evaluate with skepticism (incorporating plan completion results if available from an earlier step or adjacent section): + + **SCOPE CREEP detection:** + - Files changed that are unrelated to the stated intent + - New features or refactors not mentioned in the plan + - "While I was in there..." changes that expand blast radius + + **MISSING REQUIREMENTS detection:** + - Requirements from TODOS.md/PR description not addressed in the diff + - Test coverage gaps for stated requirements + - Partial implementations (started but not finished) + +5. Output (before the main review begins): + \`\`\` + Scope Check: [CLEAN / DRIFT DETECTED / REQUIREMENTS MISSING] + Intent: <1-line summary of what was requested> + Delivered: <1-line summary of what the diff actually does> + [If drift: list each out-of-scope change] + [If missing: list each unaddressed requirement] + \`\`\` + +6. This is **INFORMATIONAL** — does not block the review. Proceed to the next step. + +--- + --- ## Step 3.5: Pre-Landing Review @@ -1590,9 +1640,9 @@ For each classified comment: --- -## Step 3.8: Adversarial review (auto-scaled) +## Step 3.8: Adversarial review (always-on) -Adversarial review thoroughness scales automatically based on diff size. No configuration needed. +Every diff gets adversarial review from both Claude and Codex. LOC is not a proxy for risk — a 5-line auth change can be critical. **Detect diff size and tool availability:** @@ -1601,30 +1651,34 @@ DIFF_INS=$(git diff origin/ --stat | tail -1 | grep -oE '[0-9]+ insertion' DIFF_DEL=$(git diff origin/ --stat | tail -1 | grep -oE '[0-9]+ deletion' | grep -oE '[0-9]+' || echo "0") DIFF_TOTAL=$((DIFF_INS + DIFF_DEL)) which codex 2>/dev/null && echo "CODEX_AVAILABLE" || echo "CODEX_NOT_AVAILABLE" -# Respect old opt-out +# Legacy opt-out — only gates Codex passes, Claude always runs OLD_CFG=$(~/.claude/skills/gstack/bin/gstack-config get codex_reviews 2>/dev/null || true) echo "DIFF_SIZE: $DIFF_TOTAL" echo "OLD_CFG: ${OLD_CFG:-not_set}" ``` -If `OLD_CFG` is `disabled`: skip this step silently. Continue to the next step. - -**User override:** If the user explicitly requested a specific tier (e.g., "run all passes", "paranoid review", "full adversarial", "do all 4 passes", "thorough review"), honor that request regardless of diff size. Jump to the matching tier section. +If `OLD_CFG` is `disabled`: skip Codex passes only. Claude adversarial subagent still runs (it's free and fast). Jump to the "Claude adversarial subagent" section. -**Auto-select tier based on diff size:** -- **Small (< 50 lines changed):** Skip adversarial review entirely. Print: "Small diff ($DIFF_TOTAL lines) — adversarial review skipped." Continue to the next step. -- **Medium (50–199 lines changed):** Run Codex adversarial challenge (or Claude adversarial subagent if Codex unavailable). Jump to the "Medium tier" section. -- **Large (200+ lines changed):** Run all remaining passes — Codex structured review + Claude adversarial subagent + Codex adversarial. Jump to the "Large tier" section. +**User override:** If the user explicitly requested "full review", "structured review", or "P1 gate", also run the Codex structured review regardless of diff size. --- -### Medium tier (50–199 lines) +### Claude adversarial subagent (always runs) + +Dispatch via the Agent tool. The subagent has fresh context — no checklist bias from the structured review. This genuine independence catches things the primary reviewer is blind to. + +Subagent prompt: +"Read the diff for this branch with `git diff origin/`. Think like an attacker and a chaos engineer. Your job is to find ways this code will fail in production. Look for: edge cases, race conditions, security holes, resource leaks, failure modes, silent data corruption, logic errors that produce wrong results silently, error handling that swallows failures, and trust boundary violations. Be adversarial. Be thorough. No compliments — just the problems. For each finding, classify as FIXABLE (you know how to fix it) or INVESTIGATE (needs human judgment)." + +Present findings under an `ADVERSARIAL REVIEW (Claude subagent):` header. **FIXABLE findings** flow into the same Fix-First pipeline as the structured review. **INVESTIGATE findings** are presented as informational. -Claude's structured review already ran. Now add a **cross-model adversarial challenge**. +If the subagent fails or times out: "Claude adversarial subagent unavailable. Continuing." + +--- -**If Codex is available:** run the Codex adversarial challenge. **If Codex is NOT available:** fall back to the Claude adversarial subagent instead. +### Codex adversarial challenge (always runs when available) -**Codex adversarial:** +If Codex is available AND `OLD_CFG` is NOT `disabled`: ```bash TMPERR_ADV=$(mktemp /tmp/codex-adv-XXXXXXXX) @@ -1644,34 +1698,16 @@ Present the full output verbatim. This is informational — it never blocks ship - **Timeout:** "Codex timed out after 5 minutes." - **Empty response:** "Codex returned no response. Stderr: ." -On any Codex error, fall back to the Claude adversarial subagent automatically. +**Cleanup:** Run `rm -f "$TMPERR_ADV"` after processing. -**Claude adversarial subagent** (fallback when Codex unavailable or errored): - -Dispatch via the Agent tool. The subagent has fresh context — no checklist bias from the structured review. This genuine independence catches things the primary reviewer is blind to. - -Subagent prompt: -"Read the diff for this branch with `git diff origin/`. Think like an attacker and a chaos engineer. Your job is to find ways this code will fail in production. Look for: edge cases, race conditions, security holes, resource leaks, failure modes, silent data corruption, logic errors that produce wrong results silently, error handling that swallows failures, and trust boundary violations. Be adversarial. Be thorough. No compliments — just the problems. For each finding, classify as FIXABLE (you know how to fix it) or INVESTIGATE (needs human judgment)." - -Present findings under an `ADVERSARIAL REVIEW (Claude subagent):` header. **FIXABLE findings** flow into the same Fix-First pipeline as the structured review. **INVESTIGATE findings** are presented as informational. - -If the subagent fails or times out: "Claude adversarial subagent unavailable. Continuing without adversarial review." - -**Persist the review result:** -```bash -~/.claude/skills/gstack/bin/gstack-review-log '{"skill":"adversarial-review","timestamp":"'"$(date -u +%Y-%m-%dT%H:%M:%SZ)"'","status":"STATUS","source":"SOURCE","tier":"medium","commit":"'"$(git rev-parse --short HEAD)"'"}' -``` -Substitute STATUS: "clean" if no findings, "issues_found" if findings exist. SOURCE: "codex" if Codex ran, "claude" if subagent ran. If both failed, do NOT persist. - -**Cleanup:** Run `rm -f "$TMPERR_ADV"` after processing (if Codex was used). +If Codex is NOT available: "Codex CLI not found — running Claude adversarial only. Install Codex for cross-model coverage: `npm install -g @openai/codex`" --- -### Large tier (200+ lines) +### Codex structured review (large diffs only, 200+ lines) -Claude's structured review already ran. Now run **all three remaining passes** for maximum coverage: +If `DIFF_TOTAL >= 200` AND Codex is available AND `OLD_CFG` is NOT `disabled`: -**1. Codex structured review (if available):** ```bash TMPERR=$(mktemp /tmp/codex-review-XXXXXXXX) _REPO_ROOT=$(git rev-parse --show-toplevel) || { echo "ERROR: not in a git repo" >&2; exit 1; } @@ -1692,34 +1728,34 @@ B) Continue — review will still complete If A: address the findings. After fixing, re-run tests (Step 3) since code has changed. Re-run `codex review` to verify. -Read stderr for errors (same error handling as medium tier). +Read stderr for errors (same error handling as Codex adversarial above). After stderr: `rm -f "$TMPERR"` -**2. Claude adversarial subagent:** Dispatch a subagent with the adversarial prompt (same prompt as medium tier). This always runs regardless of Codex availability. +If `DIFF_TOTAL < 200`: skip this section silently. The Claude + Codex adversarial passes provide sufficient coverage for smaller diffs. -**3. Codex adversarial challenge (if available):** Run `codex exec` with the adversarial prompt (same as medium tier). +--- -If Codex is not available for steps 1 and 3, note to the user: "Codex CLI not found — large-diff review ran Claude structured + Claude adversarial (2 of 4 passes). Install Codex for full 4-pass coverage: `npm install -g @openai/codex`" +### Persist the review result -**Persist the review result AFTER all passes complete** (not after each sub-step): +After all passes complete, persist: ```bash -~/.claude/skills/gstack/bin/gstack-review-log '{"skill":"adversarial-review","timestamp":"'"$(date -u +%Y-%m-%dT%H:%M:%SZ)"'","status":"STATUS","source":"SOURCE","tier":"large","gate":"GATE","commit":"'"$(git rev-parse --short HEAD)"'"}' +~/.claude/skills/gstack/bin/gstack-review-log '{"skill":"adversarial-review","timestamp":"'"$(date -u +%Y-%m-%dT%H:%M:%SZ)"'","status":"STATUS","source":"SOURCE","tier":"always","gate":"GATE","commit":"'"$(git rev-parse --short HEAD)"'"}' ``` -Substitute: STATUS = "clean" if no findings across ALL passes, "issues_found" if any pass found issues. SOURCE = "both" if Codex ran, "claude" if only Claude subagent ran. GATE = the Codex structured review gate result ("pass"/"fail"), or "informational" if Codex was unavailable. If all passes failed, do NOT persist. +Substitute: STATUS = "clean" if no findings across ALL passes, "issues_found" if any pass found issues. SOURCE = "both" if Codex ran, "claude" if only Claude subagent ran. GATE = the Codex structured review gate result ("pass"/"fail"), "skipped" if diff < 200, or "informational" if Codex was unavailable. If all passes failed, do NOT persist. --- -### Cross-model synthesis (medium and large tiers) +### Cross-model synthesis After all passes complete, synthesize findings across all sources: ``` -ADVERSARIAL REVIEW SYNTHESIS (auto: TIER, N lines): +ADVERSARIAL REVIEW SYNTHESIS (always-on, N lines): ════════════════════════════════════════════════════════════ High confidence (found by multiple sources): [findings agreed on by >1 pass] Unique to Claude structured review: [from earlier step] - Unique to Claude adversarial: [from subagent, if ran] + Unique to Claude adversarial: [from subagent] Unique to Codex: [from codex adversarial or code review, if ran] Models used: Claude structured ✓ Claude adversarial ✓/✗ Codex ✓/✗ ════════════════════════════════════════════════════════════ @@ -1976,6 +2012,10 @@ you missed it.> +## Scope Drift + + + ## Plan Completion diff --git a/ship/SKILL.md.tmpl b/ship/SKILL.md.tmpl index 993a67a59711ea58abd94dcace66d606f130e834..a20e614a3d1accbe38217e50dc4a44edd3fc6329 100644 --- a/ship/SKILL.md.tmpl +++ b/ship/SKILL.md.tmpl @@ -232,6 +232,8 @@ If multiple suites need to run, run them sequentially (each needs a test lane). {{LEARNINGS_SEARCH}} +{{SCOPE_DRIFT}} + --- ## Step 3.5: Pre-Landing Review @@ -509,6 +511,10 @@ you missed it.> +## Scope Drift + + + ## Plan Completion diff --git a/test/skill-validation.test.ts b/test/skill-validation.test.ts index e2dcb67e95ae8b2b5d361fd120c83dc704fb68e8..91240e235311fd2b45b77c386b37f2a76abb44d0 100644 --- a/test/skill-validation.test.ts +++ b/test/skill-validation.test.ts @@ -1305,38 +1305,49 @@ describe('Codex skill', () => { expect(content).toContain('mktemp'); }); - test('adversarial review in /review auto-scales by diff size', () => { + test('adversarial review in /review always runs both passes', () => { const content = fs.readFileSync(path.join(ROOT, 'review', 'SKILL.md'), 'utf-8'); - expect(content).toContain('Adversarial review (auto-scaled)'); - // Diff size thresholds - expect(content).toContain('< 50'); - expect(content).toContain('50–199'); - expect(content).toContain('200+'); - // All three tiers present - expect(content).toContain('Small'); - expect(content).toContain('Medium tier'); - expect(content).toContain('Large tier'); + expect(content).toContain('Adversarial review (always-on)'); + // Always-on: both Claude and Codex adversarial + expect(content).toContain('Claude adversarial subagent (always runs)'); + expect(content).toContain('Codex adversarial challenge (always runs when available)'); // Claude adversarial subagent dispatch expect(content).toContain('Agent tool'); expect(content).toContain('FIXABLE'); expect(content).toContain('INVESTIGATE'); - // Codex fallback logic + // Codex availability check expect(content).toContain('CODEX_NOT_AVAILABLE'); - expect(content).toContain('fall back to the Claude adversarial subagent'); - // Review log uses new skill name + // OLD_CFG only gates Codex, not Claude + expect(content).toContain('skip Codex passes only'); + // Review log expect(content).toContain('adversarial-review'); expect(content).toContain('reasoning_effort="high"'); expect(content).toContain('ADVERSARIAL REVIEW SYNTHESIS'); + // Large diff structured review still gated + expect(content).toContain('Codex structured review (large diffs only'); + expect(content).toContain('200'); }); - test('adversarial review in /ship auto-scales by diff size', () => { + test('adversarial review in /ship always runs both passes', () => { const content = fs.readFileSync(path.join(ROOT, 'ship', 'SKILL.md'), 'utf-8'); - expect(content).toContain('Adversarial review (auto-scaled)'); - expect(content).toContain('< 50'); - expect(content).toContain('200+'); + expect(content).toContain('Adversarial review (always-on)'); expect(content).toContain('adversarial-review'); expect(content).toContain('reasoning_effort="high"'); expect(content).toContain('Investigate and fix'); + expect(content).toContain('Claude adversarial subagent (always runs)'); + }); + + test('scope drift detection in /review and /ship', () => { + const reviewContent = fs.readFileSync(path.join(ROOT, 'review', 'SKILL.md'), 'utf-8'); + const shipContent = fs.readFileSync(path.join(ROOT, 'ship', 'SKILL.md'), 'utf-8'); + // Both should contain scope drift from the shared resolver + for (const content of [reviewContent, shipContent]) { + expect(content).toContain('Scope Check:'); + expect(content).toContain('DRIFT DETECTED'); + expect(content).toContain('SCOPE CREEP'); + expect(content).toContain('MISSING REQUIREMENTS'); + expect(content).toContain('stated intent'); + } }); test('codex-host ship/review do NOT contain adversarial review step', () => {