From 7fbf68bb3f253ef799f38862fb1cb0b62d756a64 Mon Sep 17 00:00:00 2001 From: Garry Tan Date: Sun, 22 Mar 2026 21:47:10 -0700 Subject: [PATCH] feat: cross-model outside voice in plan reviews (v0.9.9.1) (#326) * feat: add generateCodexPlanReview() resolver for cross-model plan review New resolver offers an optional Codex (or Claude subagent fallback) "outside voice" after plan review sections complete. Includes cross-model tension detection with auto-TODO proposals, review log persistence, and an Outside Voice row in the Review Readiness Dashboard. Co-Authored-By: Claude Opus 4.6 (1M context) * feat: integrate {{CODEX_PLAN_REVIEW}} into CEO and eng review templates CEO review: insert after Section 11 + add Outside Voice summary row. Eng review: replace hardcoded Step 0.5 with resolver (adds fallback, logging, dashboard, xhigh reasoning, cross-model tension tracking). Co-Authored-By: Claude Opus 4.6 (1M context) * chore: regenerate SKILL.md files from updated templates Co-Authored-By: Claude Opus 4.6 (1M context) * docs: update project documentation for v0.9.9.1 ARCHITECTURE.md: added {{CODEX_PLAN_REVIEW}} to placeholder table. CHANGELOG.md: added v0.9.9.1 entry for outside voice feature. VERSION: bumped to 0.9.9.1. Co-Authored-By: Claude Opus 4.6 (1M context) * fix: move {{CODEX_PLAN_REVIEW}} after review sections in eng review Codex adversarial review caught that the placeholder was positioned before the 4 review sections, so the "After all review sections are complete" instruction could confuse the model. Moved it after Section 4's STOP directive where it belongs. Co-Authored-By: Claude Opus 4.6 (1M context) * chore: regenerate eng review SKILL.md files Co-Authored-By: Claude Opus 4.6 (1M context) --------- Co-authored-by: Claude Opus 4.6 (1M context) --- ARCHITECTURE.md | 1 + CHANGELOG.md | 12 +++ VERSION | 2 +- plan-ceo-review/SKILL.md | 124 +++++++++++++++++++++++++++- plan-ceo-review/SKILL.md.tmpl | 3 + plan-design-review/SKILL.md | 4 +- plan-eng-review/SKILL.md | 147 ++++++++++++++++++++++++++++------ plan-eng-review/SKILL.md.tmpl | 26 +----- scripts/gen-skill-docs.ts | 129 ++++++++++++++++++++++++++++- ship/SKILL.md | 4 +- 10 files changed, 400 insertions(+), 52 deletions(-) diff --git a/ARCHITECTURE.md b/ARCHITECTURE.md index b6f4541d395392599e60678db13914b787c57a5b..8ffc16aaecd2d206646ccc915cc06f2766cb6aad 100644 --- a/ARCHITECTURE.md +++ b/ARCHITECTURE.md @@ -205,6 +205,7 @@ Templates contain the workflows, tips, and examples that require human judgment. | `{{DESIGN_METHODOLOGY}}` | `gen-skill-docs.ts` | Shared design audit methodology for /plan-design-review and /design-review | | `{{REVIEW_DASHBOARD}}` | `gen-skill-docs.ts` | Review Readiness Dashboard for /ship pre-flight | | `{{TEST_BOOTSTRAP}}` | `gen-skill-docs.ts` | Test framework detection, bootstrap, CI/CD setup for /qa, /ship, /design-review | +| `{{CODEX_PLAN_REVIEW}}` | `gen-skill-docs.ts` | Optional cross-model plan review (Codex or Claude subagent fallback) for /plan-ceo-review and /plan-eng-review | This is structurally sound — if a command exists in code, it appears in docs. If it doesn't exist, it can't appear. diff --git a/CHANGELOG.md b/CHANGELOG.md index 4782dd6a01f48e6a3506b27f5ae95af0573512fa..78ac406f682f906e1ddfbb12f2126cb7fe12c577 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -1,5 +1,17 @@ # Changelog +## [0.11.5.2] - 2026-03-22 — Outside Voice + +### Added + +- **Plan reviews now offer an independent second opinion.** After all review sections complete in `/plan-ceo-review` or `/plan-eng-review`, you can get a "brutally honest outside voice" from a different AI model (Codex CLI, or a fresh Claude subagent if Codex isn't installed). It reads your plan, finds what the review missed — logical gaps, unstated assumptions, feasibility risks — and presents findings verbatim. Optional, recommended, never blocks shipping. +- **Cross-model tension detection.** When the outside voice disagrees with the review findings, the disagreements are surfaced automatically and offered as TODOs so nothing gets lost. +- **Outside Voice in the Review Readiness Dashboard.** `/ship` now shows whether an outside voice ran on the plan, alongside the existing CEO/Eng/Design/Adversarial review rows. + +### Changed + +- **`/plan-eng-review` Codex integration upgraded.** The old hardcoded Step 0.5 is replaced with a richer resolver that adds Claude subagent fallback, review log persistence, dashboard visibility, and higher reasoning effort (`xhigh`). + ## [0.11.5.1] - 2026-03-23 — Inline Office Hours ### Changed diff --git a/VERSION b/VERSION index af928090b40ebc155e96304520ccfd919d7837fc..03cdd3dfd536a4a135624e1132aec8908eb5c59a 100644 --- a/VERSION +++ b/VERSION @@ -1 +1 @@ -0.11.5.1 +0.11.5.2 diff --git a/plan-ceo-review/SKILL.md b/plan-ceo-review/SKILL.md index a0a64d837f8ca535893e1d3ce8ce719280a032db..f566a735e2b77317fbfd9e3b9b8447f3c478ba74 100644 --- a/plan-ceo-review/SKILL.md +++ b/plan-ceo-review/SKILL.md @@ -990,6 +990,125 @@ Required ASCII diagram: user flow showing screens/states and transitions. If this plan has significant UI scope, recommend: "Consider running /plan-design-review for a deep design review of this plan before implementation." **STOP.** AskUserQuestion once per issue. Do NOT batch. Recommend + WHY. If no issues or fix is obvious, state what you'll do and move on — don't waste a question. Do NOT proceed until user responds. +## Outside Voice — Independent Plan Challenge (optional, recommended) + +After all review sections are complete, offer an independent second opinion from a +different AI system. Two models agreeing on a plan is stronger signal than one model's +thorough review. + +**Check tool availability:** + +```bash +which codex 2>/dev/null && echo "CODEX_AVAILABLE" || echo "CODEX_NOT_AVAILABLE" +``` + +Use AskUserQuestion: + +> "All review sections are complete. Want an outside voice? A different AI system can +> give a brutally honest, independent challenge of this plan — logical gaps, feasibility +> risks, and blind spots that are hard to catch from inside the review. Takes about 2 +> minutes." +> +> RECOMMENDATION: Choose A — an independent second opinion catches structural blind +> spots. Two different AI models agreeing on a plan is stronger signal than one model's +> thorough review. Completeness: A=9/10, B=7/10. + +Options: +- A) Get the outside voice (recommended) +- B) Skip — proceed to outputs + +**If B:** Print "Skipping outside voice." and continue to the next section. + +**If A:** Construct the plan review prompt. Read the plan file being reviewed (the file +the user pointed this review at, or the branch diff scope). If a CEO plan document +was written in Step 0D-POST, read that too — it contains the scope decisions and vision. + +Construct this prompt (substitute the actual plan content — if plan content exceeds 30KB, +truncate to the first 30KB and note "Plan truncated for size"): + +"You are a brutally honest technical reviewer examining a development plan that has +already been through a multi-section review. Your job is NOT to repeat that review. +Instead, find what it missed. Look for: logical gaps and unstated assumptions that +survived the review scrutiny, overcomplexity (is there a fundamentally simpler +approach the review was too deep in the weeds to see?), feasibility risks the review +took for granted, missing dependencies or sequencing issues, and strategic +miscalibration (is this the right thing to build at all?). Be direct. Be terse. No +compliments. Just the problems. + +THE PLAN: +" + +**If CODEX_AVAILABLE:** + +```bash +TMPERR_PV=$(mktemp /tmp/codex-planreview-XXXXXXXX) +codex exec "" -s read-only -c 'model_reasoning_effort="xhigh"' --enable web_search_cached 2>"$TMPERR_PV" +``` + +Use a 5-minute timeout (`timeout: 300000`). After the command completes, read stderr: +```bash +cat "$TMPERR_PV" +``` + +Present the full output verbatim: + +``` +CODEX SAYS (plan review — outside voice): +════════════════════════════════════════════════════════════ + +════════════════════════════════════════════════════════════ +``` + +**Error handling:** All errors are non-blocking — the outside voice is informational. +- Auth failure (stderr contains "auth", "login", "unauthorized"): "Codex auth failed. Run \`codex login\` to authenticate." +- Timeout: "Codex timed out after 5 minutes." +- Empty response: "Codex returned no response." + +On any Codex error, fall back to the Claude adversarial subagent. + +**If CODEX_NOT_AVAILABLE (or Codex errored):** + +Dispatch via the Agent tool. The subagent has fresh context — genuine independence. + +Subagent prompt: same plan review prompt as above. + +Present findings under an `OUTSIDE VOICE (Claude subagent):` header. + +If the subagent fails or times out: "Outside voice unavailable. Continuing to outputs." + +**Cross-model tension:** + +After presenting the outside voice findings, note any points where the outside voice +disagrees with the review findings from earlier sections. Flag these as: + +``` +CROSS-MODEL TENSION: + [Topic]: Review said X. Outside voice says Y. [Your assessment of who's right.] +``` + +For each substantive tension point, auto-propose as a TODO via AskUserQuestion: + +> "Cross-model disagreement on [topic]. The review found [X] but the outside voice +> argues [Y]. Worth investigating further?" + +Options: +- A) Add to TODOS.md +- B) Skip — not substantive + +If no tension points exist, note: "No cross-model tension — both reviewers agree." + +**Persist the result:** +```bash +~/.claude/skills/gstack/bin/gstack-review-log '{"skill":"codex-plan-review","timestamp":"'"$(date -u +%Y-%m-%dT%H:%M:%SZ)"'","status":"STATUS","source":"SOURCE","commit":"'"$(git rev-parse --short HEAD)"'"}' +``` + +Substitute: STATUS = "clean" if no findings, "issues_found" if findings exist. +SOURCE = "codex" if Codex ran, "claude" if subagent ran. + +**Cleanup:** Run `rm -f "$TMPERR_PV"` after processing (if Codex was used). + +--- + ## Post-Implementation Design Audit (if UI scope detected) After implementation, run `/design-review` on the live site to catch visual issues that can only be evaluated with rendered output. @@ -1084,6 +1203,7 @@ List every ASCII diagram in files this plan touches. Still accurate? | TODOS.md updates | ___ items proposed | | Scope proposals | ___ proposed, ___ accepted (EXP + SEL) | | CEO plan | written / skipped (HOLD/REDUCTION) | + | Outside voice | ran (codex/claude) / skipped | | Lake Score | X/Y recommendations chose complete option | | Diagrams produced | ___ (list types) | | Stale diagrams found | ___ | @@ -1137,7 +1257,7 @@ After completing the review, read the review log and config to display the dashb ~/.claude/skills/gstack/bin/gstack-review-read ``` -Parse the output. Find the most recent entry for each skill (plan-ceo-review, plan-eng-review, plan-design-review, design-review-lite, adversarial-review, codex-review). Ignore entries with timestamps older than 7 days. For the Adversarial row, show whichever is more recent between `adversarial-review` (new auto-scaled) and `codex-review` (legacy). For Design Review, show whichever is more recent between `plan-design-review` (full visual audit) and `design-review-lite` (code-level check). Append "(FULL)" or "(LITE)" to the status to distinguish. Display: +Parse the output. Find the most recent entry for each skill (plan-ceo-review, plan-eng-review, plan-design-review, design-review-lite, adversarial-review, codex-review, codex-plan-review). Ignore entries with timestamps older than 7 days. For the Adversarial row, show whichever is more recent between `adversarial-review` (new auto-scaled) and `codex-review` (legacy). For Design Review, show whichever is more recent between `plan-design-review` (full visual audit) and `design-review-lite` (code-level check). Append "(FULL)" or "(LITE)" to the status to distinguish. Display: ``` +====================================================================+ @@ -1149,6 +1269,7 @@ Parse the output. Find the most recent entry for each skill (plan-ceo-review, pl | CEO Review | 0 | — | — | no | | Design Review | 0 | — | — | no | | Adversarial | 0 | — | — | no | +| Outside Voice | 0 | — | — | no | +--------------------------------------------------------------------+ | VERDICT: CLEARED — Eng Review passed | +====================================================================+ @@ -1159,6 +1280,7 @@ Parse the output. Find the most recent entry for each skill (plan-ceo-review, pl - **CEO Review (optional):** Use your judgment. Recommend it for big product/business changes, new user-facing features, or scope decisions. Skip for bug fixes, refactors, infra, and cleanup. - **Design Review (optional):** Use your judgment. Recommend it for UI/UX changes. Skip for backend-only, infra, or prompt-only changes. - **Adversarial Review (automatic):** Auto-scales by diff size. Small diffs (<50 lines) skip adversarial. Medium diffs (50–199) get cross-model adversarial. Large diffs (200+) get all 4 passes: Claude structured, Codex structured, Claude adversarial subagent, Codex adversarial. No configuration needed. +- **Outside Voice (optional):** Independent plan review from a different AI model. Offered after all review sections complete in /plan-ceo-review and /plan-eng-review. Falls back to Claude subagent if Codex is unavailable. Never gates shipping. **Verdict logic:** - **CLEARED**: Eng Review has >= 1 entry within 7 days with status "clean" (or \`skip_eng_review\` is \`true\`) diff --git a/plan-ceo-review/SKILL.md.tmpl b/plan-ceo-review/SKILL.md.tmpl index 1c49b0197080c2cc4bf51141a7c2e88381010506..945fcaa6a31a92a1ca1c497d666966d3af54bbc5 100644 --- a/plan-ceo-review/SKILL.md.tmpl +++ b/plan-ceo-review/SKILL.md.tmpl @@ -589,6 +589,8 @@ Required ASCII diagram: user flow showing screens/states and transitions. If this plan has significant UI scope, recommend: "Consider running /plan-design-review for a deep design review of this plan before implementation." **STOP.** AskUserQuestion once per issue. Do NOT batch. Recommend + WHY. If no issues or fix is obvious, state what you'll do and move on — don't waste a question. Do NOT proceed until user responds. +{{CODEX_PLAN_REVIEW}} + ## Post-Implementation Design Audit (if UI scope detected) After implementation, run `/design-review` on the live site to catch visual issues that can only be evaluated with rendered output. @@ -683,6 +685,7 @@ List every ASCII diagram in files this plan touches. Still accurate? | TODOS.md updates | ___ items proposed | | Scope proposals | ___ proposed, ___ accepted (EXP + SEL) | | CEO plan | written / skipped (HOLD/REDUCTION) | + | Outside voice | ran (codex/claude) / skipped | | Lake Score | X/Y recommendations chose complete option | | Diagrams produced | ___ (list types) | | Stale diagrams found | ___ | diff --git a/plan-design-review/SKILL.md b/plan-design-review/SKILL.md index fb0bbfa2fede5c6e78f046de3928ff775ff6e1bf..57d5c6355cfcb8f5d3c6844c450e55d90701de25 100644 --- a/plan-design-review/SKILL.md +++ b/plan-design-review/SKILL.md @@ -763,7 +763,7 @@ After completing the review, read the review log and config to display the dashb ~/.claude/skills/gstack/bin/gstack-review-read ``` -Parse the output. Find the most recent entry for each skill (plan-ceo-review, plan-eng-review, plan-design-review, design-review-lite, adversarial-review, codex-review). Ignore entries with timestamps older than 7 days. For the Adversarial row, show whichever is more recent between `adversarial-review` (new auto-scaled) and `codex-review` (legacy). For Design Review, show whichever is more recent between `plan-design-review` (full visual audit) and `design-review-lite` (code-level check). Append "(FULL)" or "(LITE)" to the status to distinguish. Display: +Parse the output. Find the most recent entry for each skill (plan-ceo-review, plan-eng-review, plan-design-review, design-review-lite, adversarial-review, codex-review, codex-plan-review). Ignore entries with timestamps older than 7 days. For the Adversarial row, show whichever is more recent between `adversarial-review` (new auto-scaled) and `codex-review` (legacy). For Design Review, show whichever is more recent between `plan-design-review` (full visual audit) and `design-review-lite` (code-level check). Append "(FULL)" or "(LITE)" to the status to distinguish. Display: ``` +====================================================================+ @@ -775,6 +775,7 @@ Parse the output. Find the most recent entry for each skill (plan-ceo-review, pl | CEO Review | 0 | — | — | no | | Design Review | 0 | — | — | no | | Adversarial | 0 | — | — | no | +| Outside Voice | 0 | — | — | no | +--------------------------------------------------------------------+ | VERDICT: CLEARED — Eng Review passed | +====================================================================+ @@ -785,6 +786,7 @@ Parse the output. Find the most recent entry for each skill (plan-ceo-review, pl - **CEO Review (optional):** Use your judgment. Recommend it for big product/business changes, new user-facing features, or scope decisions. Skip for bug fixes, refactors, infra, and cleanup. - **Design Review (optional):** Use your judgment. Recommend it for UI/UX changes. Skip for backend-only, infra, or prompt-only changes. - **Adversarial Review (automatic):** Auto-scales by diff size. Small diffs (<50 lines) skip adversarial. Medium diffs (50–199) get cross-model adversarial. Large diffs (200+) get all 4 passes: Claude structured, Codex structured, Claude adversarial subagent, Codex adversarial. No configuration needed. +- **Outside Voice (optional):** Independent plan review from a different AI model. Offered after all review sections complete in /plan-ceo-review and /plan-eng-review. Falls back to Claude subagent if Codex is unavailable. Never gates shipping. **Verdict logic:** - **CLEARED**: Eng Review has >= 1 entry within 7 days with status "clean" (or \`skip_eng_review\` is \`true\`) diff --git a/plan-eng-review/SKILL.md b/plan-eng-review/SKILL.md index a9bb39b4a26ab32bab25c23148fac64b54dcd293..1e86eacee42c961613a479e246a1cbc050fb269d 100644 --- a/plan-eng-review/SKILL.md +++ b/plan-eng-review/SKILL.md @@ -420,29 +420,6 @@ Before reviewing anything, answer these questions: If the complexity check triggers (8+ files or 2+ new classes/services), proactively recommend scope reduction via AskUserQuestion — explain what's overbuilt, propose a minimal version that achieves the core goal, and ask whether to reduce or proceed as-is. If the complexity check does not trigger, present your Step 0 findings and proceed directly to Section 1. -### Step 0.5: Codex plan review (optional) - -Check if the Codex CLI is available: `which codex 2>/dev/null` - -If available, after presenting Step 0 findings, use AskUserQuestion: -``` -Want an independent Codex (OpenAI) review of this plan before the detailed review? -A) Yes — let Codex critique the plan independently -B) No — proceed with the Claude review only -``` - -If the user chooses A: tell Codex to read the plan file itself (avoids ARG_MAX limits for large plans): -```bash -codex exec "You are a brutally honest technical reviewer. Read the plan file at and review it for: logical gaps and unstated assumptions, missing error handling or edge cases, overcomplexity (is there a simpler approach?), feasibility risks (what could go wrong?), and missing dependencies or sequencing issues. Be direct. Be terse. No compliments. Just the problems." -s read-only -c 'model_reasoning_effort="high"' --enable web_search_cached -``` - -Replace `` with the actual path to the plan file detected earlier. Codex has filesystem access in read-only mode and will read the file itself. - -Present the full output under a `CODEX SAYS (plan review):` header. Note any concerns -that should inform the subsequent engineering review sections. - -If Codex is not available, skip silently. - Always work through the full interactive review: one section at a time (Architecture → Code Quality → Tests → Performance) with at most 8 top issues per section. **Critical: Once the user accepts or rejects a scope reduction recommendation, commit fully.** Do not re-argue for smaller scope during later review sections. Do not silently reduce scope or skip planned components. @@ -684,6 +661,125 @@ Evaluate: **STOP.** For each issue found in this section, call AskUserQuestion individually. One issue per call. Present options, state your recommendation, explain WHY. Do NOT batch multiple issues into one AskUserQuestion. Only proceed to the next section after ALL issues in this section are resolved. +## Outside Voice — Independent Plan Challenge (optional, recommended) + +After all review sections are complete, offer an independent second opinion from a +different AI system. Two models agreeing on a plan is stronger signal than one model's +thorough review. + +**Check tool availability:** + +```bash +which codex 2>/dev/null && echo "CODEX_AVAILABLE" || echo "CODEX_NOT_AVAILABLE" +``` + +Use AskUserQuestion: + +> "All review sections are complete. Want an outside voice? A different AI system can +> give a brutally honest, independent challenge of this plan — logical gaps, feasibility +> risks, and blind spots that are hard to catch from inside the review. Takes about 2 +> minutes." +> +> RECOMMENDATION: Choose A — an independent second opinion catches structural blind +> spots. Two different AI models agreeing on a plan is stronger signal than one model's +> thorough review. Completeness: A=9/10, B=7/10. + +Options: +- A) Get the outside voice (recommended) +- B) Skip — proceed to outputs + +**If B:** Print "Skipping outside voice." and continue to the next section. + +**If A:** Construct the plan review prompt. Read the plan file being reviewed (the file +the user pointed this review at, or the branch diff scope). If a CEO plan document +was written in Step 0D-POST, read that too — it contains the scope decisions and vision. + +Construct this prompt (substitute the actual plan content — if plan content exceeds 30KB, +truncate to the first 30KB and note "Plan truncated for size"): + +"You are a brutally honest technical reviewer examining a development plan that has +already been through a multi-section review. Your job is NOT to repeat that review. +Instead, find what it missed. Look for: logical gaps and unstated assumptions that +survived the review scrutiny, overcomplexity (is there a fundamentally simpler +approach the review was too deep in the weeds to see?), feasibility risks the review +took for granted, missing dependencies or sequencing issues, and strategic +miscalibration (is this the right thing to build at all?). Be direct. Be terse. No +compliments. Just the problems. + +THE PLAN: +" + +**If CODEX_AVAILABLE:** + +```bash +TMPERR_PV=$(mktemp /tmp/codex-planreview-XXXXXXXX) +codex exec "" -s read-only -c 'model_reasoning_effort="xhigh"' --enable web_search_cached 2>"$TMPERR_PV" +``` + +Use a 5-minute timeout (`timeout: 300000`). After the command completes, read stderr: +```bash +cat "$TMPERR_PV" +``` + +Present the full output verbatim: + +``` +CODEX SAYS (plan review — outside voice): +════════════════════════════════════════════════════════════ + +════════════════════════════════════════════════════════════ +``` + +**Error handling:** All errors are non-blocking — the outside voice is informational. +- Auth failure (stderr contains "auth", "login", "unauthorized"): "Codex auth failed. Run \`codex login\` to authenticate." +- Timeout: "Codex timed out after 5 minutes." +- Empty response: "Codex returned no response." + +On any Codex error, fall back to the Claude adversarial subagent. + +**If CODEX_NOT_AVAILABLE (or Codex errored):** + +Dispatch via the Agent tool. The subagent has fresh context — genuine independence. + +Subagent prompt: same plan review prompt as above. + +Present findings under an `OUTSIDE VOICE (Claude subagent):` header. + +If the subagent fails or times out: "Outside voice unavailable. Continuing to outputs." + +**Cross-model tension:** + +After presenting the outside voice findings, note any points where the outside voice +disagrees with the review findings from earlier sections. Flag these as: + +``` +CROSS-MODEL TENSION: + [Topic]: Review said X. Outside voice says Y. [Your assessment of who's right.] +``` + +For each substantive tension point, auto-propose as a TODO via AskUserQuestion: + +> "Cross-model disagreement on [topic]. The review found [X] but the outside voice +> argues [Y]. Worth investigating further?" + +Options: +- A) Add to TODOS.md +- B) Skip — not substantive + +If no tension points exist, note: "No cross-model tension — both reviewers agree." + +**Persist the result:** +```bash +~/.claude/skills/gstack/bin/gstack-review-log '{"skill":"codex-plan-review","timestamp":"'"$(date -u +%Y-%m-%dT%H:%M:%SZ)"'","status":"STATUS","source":"SOURCE","commit":"'"$(git rev-parse --short HEAD)"'"}' +``` + +Substitute: STATUS = "clean" if no findings, "issues_found" if findings exist. +SOURCE = "codex" if Codex ran, "claude" if subagent ran. + +**Cleanup:** Run `rm -f "$TMPERR_PV"` after processing (if Codex was used). + +--- + ## CRITICAL RULE — How to ask questions Follow the AskUserQuestion format from the Preamble above. Additional rules for plan reviews: * **One issue = one AskUserQuestion call.** Never combine multiple issues into one question. @@ -739,6 +835,7 @@ At the end of the review, fill in and display this summary so the user can see a - What already exists: written - TODOS.md updates: ___ items proposed to user - Failure modes: ___ critical gaps flagged +- Outside voice: ran (codex/claude) / skipped - Lake Score: X/Y recommendations chose complete option ## Retrospective learning @@ -781,7 +878,7 @@ After completing the review, read the review log and config to display the dashb ~/.claude/skills/gstack/bin/gstack-review-read ``` -Parse the output. Find the most recent entry for each skill (plan-ceo-review, plan-eng-review, plan-design-review, design-review-lite, adversarial-review, codex-review). Ignore entries with timestamps older than 7 days. For the Adversarial row, show whichever is more recent between `adversarial-review` (new auto-scaled) and `codex-review` (legacy). For Design Review, show whichever is more recent between `plan-design-review` (full visual audit) and `design-review-lite` (code-level check). Append "(FULL)" or "(LITE)" to the status to distinguish. Display: +Parse the output. Find the most recent entry for each skill (plan-ceo-review, plan-eng-review, plan-design-review, design-review-lite, adversarial-review, codex-review, codex-plan-review). Ignore entries with timestamps older than 7 days. For the Adversarial row, show whichever is more recent between `adversarial-review` (new auto-scaled) and `codex-review` (legacy). For Design Review, show whichever is more recent between `plan-design-review` (full visual audit) and `design-review-lite` (code-level check). Append "(FULL)" or "(LITE)" to the status to distinguish. Display: ``` +====================================================================+ @@ -793,6 +890,7 @@ Parse the output. Find the most recent entry for each skill (plan-ceo-review, pl | CEO Review | 0 | — | — | no | | Design Review | 0 | — | — | no | | Adversarial | 0 | — | — | no | +| Outside Voice | 0 | — | — | no | +--------------------------------------------------------------------+ | VERDICT: CLEARED — Eng Review passed | +====================================================================+ @@ -803,6 +901,7 @@ Parse the output. Find the most recent entry for each skill (plan-ceo-review, pl - **CEO Review (optional):** Use your judgment. Recommend it for big product/business changes, new user-facing features, or scope decisions. Skip for bug fixes, refactors, infra, and cleanup. - **Design Review (optional):** Use your judgment. Recommend it for UI/UX changes. Skip for backend-only, infra, or prompt-only changes. - **Adversarial Review (automatic):** Auto-scales by diff size. Small diffs (<50 lines) skip adversarial. Medium diffs (50–199) get cross-model adversarial. Large diffs (200+) get all 4 passes: Claude structured, Codex structured, Claude adversarial subagent, Codex adversarial. No configuration needed. +- **Outside Voice (optional):** Independent plan review from a different AI model. Offered after all review sections complete in /plan-ceo-review and /plan-eng-review. Falls back to Claude subagent if Codex is unavailable. Never gates shipping. **Verdict logic:** - **CLEARED**: Eng Review has >= 1 entry within 7 days with status "clean" (or \`skip_eng_review\` is \`true\`) diff --git a/plan-eng-review/SKILL.md.tmpl b/plan-eng-review/SKILL.md.tmpl index c9ebde44d57522107736a3715dbdd97a8a00f651..44d64a0e802e4e4f5bca0a32f8cffa2761e0985d 100644 --- a/plan-eng-review/SKILL.md.tmpl +++ b/plan-eng-review/SKILL.md.tmpl @@ -96,29 +96,6 @@ Before reviewing anything, answer these questions: If the complexity check triggers (8+ files or 2+ new classes/services), proactively recommend scope reduction via AskUserQuestion — explain what's overbuilt, propose a minimal version that achieves the core goal, and ask whether to reduce or proceed as-is. If the complexity check does not trigger, present your Step 0 findings and proceed directly to Section 1. -### Step 0.5: Codex plan review (optional) - -Check if the Codex CLI is available: `which codex 2>/dev/null` - -If available, after presenting Step 0 findings, use AskUserQuestion: -``` -Want an independent Codex (OpenAI) review of this plan before the detailed review? -A) Yes — let Codex critique the plan independently -B) No — proceed with the Claude review only -``` - -If the user chooses A: tell Codex to read the plan file itself (avoids ARG_MAX limits for large plans): -```bash -codex exec "You are a brutally honest technical reviewer. Read the plan file at and review it for: logical gaps and unstated assumptions, missing error handling or edge cases, overcomplexity (is there a simpler approach?), feasibility risks (what could go wrong?), and missing dependencies or sequencing issues. Be direct. Be terse. No compliments. Just the problems." -s read-only -c 'model_reasoning_effort="high"' --enable web_search_cached -``` - -Replace `` with the actual path to the plan file detected earlier. Codex has filesystem access in read-only mode and will read the file itself. - -Present the full output under a `CODEX SAYS (plan review):` header. Note any concerns -that should inform the subsequent engineering review sections. - -If Codex is not available, skip silently. - Always work through the full interactive review: one section at a time (Architecture → Code Quality → Tests → Performance) with at most 8 top issues per section. **Critical: Once the user accepts or rejects a scope reduction recommendation, commit fully.** Do not re-argue for smaller scope during later review sections. Do not silently reduce scope or skip planned components. @@ -165,6 +142,8 @@ Evaluate: **STOP.** For each issue found in this section, call AskUserQuestion individually. One issue per call. Present options, state your recommendation, explain WHY. Do NOT batch multiple issues into one AskUserQuestion. Only proceed to the next section after ALL issues in this section are resolved. +{{CODEX_PLAN_REVIEW}} + ## CRITICAL RULE — How to ask questions Follow the AskUserQuestion format from the Preamble above. Additional rules for plan reviews: * **One issue = one AskUserQuestion call.** Never combine multiple issues into one question. @@ -220,6 +199,7 @@ At the end of the review, fill in and display this summary so the user can see a - What already exists: written - TODOS.md updates: ___ items proposed to user - Failure modes: ___ critical gaps flagged +- Outside voice: ran (codex/claude) / skipped - Lake Score: X/Y recommendations chose complete option ## Retrospective learning diff --git a/scripts/gen-skill-docs.ts b/scripts/gen-skill-docs.ts index 33906183dec2e4643813c2615c9bf4debe6e62f7..7d99d0de526495638087abe6cc9190978de82c8a 100644 --- a/scripts/gen-skill-docs.ts +++ b/scripts/gen-skill-docs.ts @@ -1311,7 +1311,7 @@ After completing the review, read the review log and config to display the dashb ~/.claude/skills/gstack/bin/gstack-review-read \`\`\` -Parse the output. Find the most recent entry for each skill (plan-ceo-review, plan-eng-review, plan-design-review, design-review-lite, adversarial-review, codex-review). Ignore entries with timestamps older than 7 days. For the Adversarial row, show whichever is more recent between \`adversarial-review\` (new auto-scaled) and \`codex-review\` (legacy). For Design Review, show whichever is more recent between \`plan-design-review\` (full visual audit) and \`design-review-lite\` (code-level check). Append "(FULL)" or "(LITE)" to the status to distinguish. Display: +Parse the output. Find the most recent entry for each skill (plan-ceo-review, plan-eng-review, plan-design-review, design-review-lite, adversarial-review, codex-review, codex-plan-review). Ignore entries with timestamps older than 7 days. For the Adversarial row, show whichever is more recent between \`adversarial-review\` (new auto-scaled) and \`codex-review\` (legacy). For Design Review, show whichever is more recent between \`plan-design-review\` (full visual audit) and \`design-review-lite\` (code-level check). Append "(FULL)" or "(LITE)" to the status to distinguish. Display: \`\`\` +====================================================================+ @@ -1323,6 +1323,7 @@ Parse the output. Find the most recent entry for each skill (plan-ceo-review, pl | CEO Review | 0 | — | — | no | | Design Review | 0 | — | — | no | | Adversarial | 0 | — | — | no | +| Outside Voice | 0 | — | — | no | +--------------------------------------------------------------------+ | VERDICT: CLEARED — Eng Review passed | +====================================================================+ @@ -1333,6 +1334,7 @@ Parse the output. Find the most recent entry for each skill (plan-ceo-review, pl - **CEO Review (optional):** Use your judgment. Recommend it for big product/business changes, new user-facing features, or scope decisions. Skip for bug fixes, refactors, infra, and cleanup. - **Design Review (optional):** Use your judgment. Recommend it for UI/UX changes. Skip for backend-only, infra, or prompt-only changes. - **Adversarial Review (automatic):** Auto-scales by diff size. Small diffs (<50 lines) skip adversarial. Medium diffs (50–199) get cross-model adversarial. Large diffs (200+) get all 4 passes: Claude structured, Codex structured, Claude adversarial subagent, Codex adversarial. No configuration needed. +- **Outside Voice (optional):** Independent plan review from a different AI model. Offered after all review sections complete in /plan-ceo-review and /plan-eng-review. Falls back to Claude subagent if Codex is unavailable. Never gates shipping. **Verdict logic:** - **CLEARED**: Eng Review has >= 1 entry within 7 days with status "clean" (or \\\`skip_eng_review\\\` is \\\`true\\\`) @@ -2380,6 +2382,130 @@ High-confidence findings (agreed on by multiple sources) should be prioritized f ---`; } +function generateCodexPlanReview(ctx: TemplateContext): string { + // Codex host: strip entirely — Codex should never invoke itself + if (ctx.host === 'codex') return ''; + + return `## Outside Voice — Independent Plan Challenge (optional, recommended) + +After all review sections are complete, offer an independent second opinion from a +different AI system. Two models agreeing on a plan is stronger signal than one model's +thorough review. + +**Check tool availability:** + +\`\`\`bash +which codex 2>/dev/null && echo "CODEX_AVAILABLE" || echo "CODEX_NOT_AVAILABLE" +\`\`\` + +Use AskUserQuestion: + +> "All review sections are complete. Want an outside voice? A different AI system can +> give a brutally honest, independent challenge of this plan — logical gaps, feasibility +> risks, and blind spots that are hard to catch from inside the review. Takes about 2 +> minutes." +> +> RECOMMENDATION: Choose A — an independent second opinion catches structural blind +> spots. Two different AI models agreeing on a plan is stronger signal than one model's +> thorough review. Completeness: A=9/10, B=7/10. + +Options: +- A) Get the outside voice (recommended) +- B) Skip — proceed to outputs + +**If B:** Print "Skipping outside voice." and continue to the next section. + +**If A:** Construct the plan review prompt. Read the plan file being reviewed (the file +the user pointed this review at, or the branch diff scope). If a CEO plan document +was written in Step 0D-POST, read that too — it contains the scope decisions and vision. + +Construct this prompt (substitute the actual plan content — if plan content exceeds 30KB, +truncate to the first 30KB and note "Plan truncated for size"): + +"You are a brutally honest technical reviewer examining a development plan that has +already been through a multi-section review. Your job is NOT to repeat that review. +Instead, find what it missed. Look for: logical gaps and unstated assumptions that +survived the review scrutiny, overcomplexity (is there a fundamentally simpler +approach the review was too deep in the weeds to see?), feasibility risks the review +took for granted, missing dependencies or sequencing issues, and strategic +miscalibration (is this the right thing to build at all?). Be direct. Be terse. No +compliments. Just the problems. + +THE PLAN: +" + +**If CODEX_AVAILABLE:** + +\`\`\`bash +TMPERR_PV=$(mktemp /tmp/codex-planreview-XXXXXXXX) +codex exec "" -s read-only -c 'model_reasoning_effort="xhigh"' --enable web_search_cached 2>"$TMPERR_PV" +\`\`\` + +Use a 5-minute timeout (\`timeout: 300000\`). After the command completes, read stderr: +\`\`\`bash +cat "$TMPERR_PV" +\`\`\` + +Present the full output verbatim: + +\`\`\` +CODEX SAYS (plan review — outside voice): +════════════════════════════════════════════════════════════ + +════════════════════════════════════════════════════════════ +\`\`\` + +**Error handling:** All errors are non-blocking — the outside voice is informational. +- Auth failure (stderr contains "auth", "login", "unauthorized"): "Codex auth failed. Run \\\`codex login\\\` to authenticate." +- Timeout: "Codex timed out after 5 minutes." +- Empty response: "Codex returned no response." + +On any Codex error, fall back to the Claude adversarial subagent. + +**If CODEX_NOT_AVAILABLE (or Codex errored):** + +Dispatch via the Agent tool. The subagent has fresh context — genuine independence. + +Subagent prompt: same plan review prompt as above. + +Present findings under an \`OUTSIDE VOICE (Claude subagent):\` header. + +If the subagent fails or times out: "Outside voice unavailable. Continuing to outputs." + +**Cross-model tension:** + +After presenting the outside voice findings, note any points where the outside voice +disagrees with the review findings from earlier sections. Flag these as: + +\`\`\` +CROSS-MODEL TENSION: + [Topic]: Review said X. Outside voice says Y. [Your assessment of who's right.] +\`\`\` + +For each substantive tension point, auto-propose as a TODO via AskUserQuestion: + +> "Cross-model disagreement on [topic]. The review found [X] but the outside voice +> argues [Y]. Worth investigating further?" + +Options: +- A) Add to TODOS.md +- B) Skip — not substantive + +If no tension points exist, note: "No cross-model tension — both reviewers agree." + +**Persist the result:** +\`\`\`bash +~/.claude/skills/gstack/bin/gstack-review-log '{"skill":"codex-plan-review","timestamp":"'"$(date -u +%Y-%m-%dT%H:%M:%SZ)"'","status":"STATUS","source":"SOURCE","commit":"'"$(git rev-parse --short HEAD)"'"}' +\`\`\` + +Substitute: STATUS = "clean" if no findings, "issues_found" if findings exist. +SOURCE = "codex" if Codex ran, "claude" if subagent ran. + +**Cleanup:** Run \`rm -f "$TMPERR_PV"\` after processing (if Codex was used). + +---`; +} + function generateDeployBootstrap(_ctx: TemplateContext): string { return `\`\`\`bash # Check for persisted deploy config in CLAUDE.md @@ -2696,6 +2822,7 @@ const RESOLVERS: Record string> = { CODEX_REVIEW_STEP: generateAdversarialStep, ADVERSARIAL_STEP: generateAdversarialStep, DEPLOY_BOOTSTRAP: generateDeployBootstrap, + CODEX_PLAN_REVIEW: generateCodexPlanReview, }; // ─── Codex Helpers ─────────────────────────────────────────── diff --git a/ship/SKILL.md b/ship/SKILL.md index ce3196eaaee53d3e4f92780271d44895ba82a0d9..f5920c9c2d15b3b66ff860358a3bf45625f1a360 100644 --- a/ship/SKILL.md +++ b/ship/SKILL.md @@ -356,7 +356,7 @@ After completing the review, read the review log and config to display the dashb ~/.claude/skills/gstack/bin/gstack-review-read ``` -Parse the output. Find the most recent entry for each skill (plan-ceo-review, plan-eng-review, plan-design-review, design-review-lite, adversarial-review, codex-review). Ignore entries with timestamps older than 7 days. For the Adversarial row, show whichever is more recent between `adversarial-review` (new auto-scaled) and `codex-review` (legacy). For Design Review, show whichever is more recent between `plan-design-review` (full visual audit) and `design-review-lite` (code-level check). Append "(FULL)" or "(LITE)" to the status to distinguish. Display: +Parse the output. Find the most recent entry for each skill (plan-ceo-review, plan-eng-review, plan-design-review, design-review-lite, adversarial-review, codex-review, codex-plan-review). Ignore entries with timestamps older than 7 days. For the Adversarial row, show whichever is more recent between `adversarial-review` (new auto-scaled) and `codex-review` (legacy). For Design Review, show whichever is more recent between `plan-design-review` (full visual audit) and `design-review-lite` (code-level check). Append "(FULL)" or "(LITE)" to the status to distinguish. Display: ``` +====================================================================+ @@ -368,6 +368,7 @@ Parse the output. Find the most recent entry for each skill (plan-ceo-review, pl | CEO Review | 0 | — | — | no | | Design Review | 0 | — | — | no | | Adversarial | 0 | — | — | no | +| Outside Voice | 0 | — | — | no | +--------------------------------------------------------------------+ | VERDICT: CLEARED — Eng Review passed | +====================================================================+ @@ -378,6 +379,7 @@ Parse the output. Find the most recent entry for each skill (plan-ceo-review, pl - **CEO Review (optional):** Use your judgment. Recommend it for big product/business changes, new user-facing features, or scope decisions. Skip for bug fixes, refactors, infra, and cleanup. - **Design Review (optional):** Use your judgment. Recommend it for UI/UX changes. Skip for backend-only, infra, or prompt-only changes. - **Adversarial Review (automatic):** Auto-scales by diff size. Small diffs (<50 lines) skip adversarial. Medium diffs (50–199) get cross-model adversarial. Large diffs (200+) get all 4 passes: Claude structured, Codex structured, Claude adversarial subagent, Codex adversarial. No configuration needed. +- **Outside Voice (optional):** Independent plan review from a different AI model. Offered after all review sections complete in /plan-ceo-review and /plan-eng-review. Falls back to Claude subagent if Codex is unavailable. Never gates shipping. **Verdict logic:** - **CLEARED**: Eng Review has >= 1 entry within 7 days with status "clean" (or \`skip_eng_review\` is \`true\`)