~cytrogen/gstack

2c5ae38542fa7f1137fbc3b07b8b59e002968ff0 — Garry Tan a month ago 6f1bdb6
feat: triple-voice multi-model review in /autoplan (v0.11.12.0) (#424)

* feat: triple-voice multi-model review in /autoplan

Add dual voices (Codex + Claude subagent) to all 3 autoplan phases with
phase-tailored prompts, consensus tables, and cross-phase synthesis.

- Phase 1 (CEO): strategy-focused Codex + independent Claude subagent
- Phase 2 (Design): UX-focused voices with CEO findings cascaded to Codex only
- Phase 3 (Eng): architecture voices with CEO+Design findings cascaded to Codex
- Sequential enforcement: STOP markers + pre-phase checklists
- Phase-transition summaries between phases
- Cross-phase themes in Phase 4 gate
- Degradation matrix for voice failures

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* chore: bump version and changelog (v0.11.12.0)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
4 files changed, 455 insertions(+), 13 deletions(-)

M CHANGELOG.md
M VERSION
M autoplan/SKILL.md
M autoplan/SKILL.md.tmpl
M CHANGELOG.md => CHANGELOG.md +14 -0
@@ 1,5 1,19 @@
# Changelog

## [0.11.12.0] - 2026-03-24 — Triple-Voice Autoplan

Every `/autoplan` phase now gets two independent second opinions — one from Codex (OpenAI's frontier model) and one from a fresh Claude subagent. Three AI reviewers looking at your plan from different angles, each phase building on the last.

### Added

- **Dual voices in every autoplan phase.** CEO review, Design review, and Eng review each run both a Codex challenge and an independent Claude subagent simultaneously. You get a consensus table showing where the models agree and disagree — disagreements surface as taste decisions at the final gate.
- **Phase-cascading context.** Codex gets prior-phase findings as context (CEO concerns inform Design review, CEO+Design inform Eng). Claude subagent stays truly independent for genuine cross-model validation.
- **Structured consensus tables.** CEO phase scores 6 strategic dimensions, Design uses the litmus scorecard, Eng scores 6 architecture dimensions. CONFIRMED/DISAGREE for each.
- **Cross-phase synthesis.** Phase 4 gate highlights themes that appeared independently in multiple phases — high-confidence signals when different reviewers catch the same issue.
- **Sequential enforcement.** STOP markers between phases + pre-phase checklists prevent autoplan from accidentally parallelizing CEO/Design/Eng (each phase depends on the previous).
- **Phase-transition summaries.** Brief status at each phase boundary so you can track progress without waiting for the full pipeline.
- **Degradation matrix.** When Codex or the Claude subagent fails, autoplan gracefully degrades with clear labels (`[codex-only]`, `[subagent-only]`, `[single-reviewer mode]`).

## [0.11.11.0] - 2026-03-23 — Community Wave 3

10 community PRs merged — bug fixes, platform support, and workflow improvements.

M VERSION => VERSION +1 -1
@@ 1,1 1,1 @@
0.11.11.0
0.11.12.0

M autoplan/SKILL.md => autoplan/SKILL.md +220 -6
@@ 415,6 415,17 @@ Examples: run codex (always yes), run evals (always yes), reduce scope on a comp

---

## Sequential Execution — MANDATORY

Phases MUST execute in strict order: CEO → Design → Eng.
Each phase MUST complete fully before the next begins.
NEVER run phases in parallel — each builds on the previous.

Between each phase, emit a phase-transition summary and verify that all required
outputs from the prior phase are written before starting the next.

---

## What "Auto-Decide" Means

Auto-decide replaces the USER'S judgment with the 6 principles. It does NOT replace


@@ 500,6 511,8 @@ Read each file using the Read tool:
- Review Readiness Dashboard
- Plan File Review Report
- Prerequisite Skill Offer (BENEFITS_FROM)
- Outside Voice — Independent Plan Challenge
- Design Outside Voices (parallel)

Follow ONLY the review-specific methodology, sections, and required outputs.



@@ 523,6 536,38 @@ Override: every AskUserQuestion → auto-decide using the 6 principles.
- Scope expansion: in blast radius + <1d CC → approve (P2). Outside → defer to TODOS.md (P3).
  Duplicates → reject (P4). Borderline (3-5 files) → mark TASTE DECISION.
- All 10 review sections: run fully, auto-decide each issue, log every decision.
- Dual voices: always run BOTH Claude subagent AND Codex if available (P6).
  Run them simultaneously (Agent tool for subagent, Bash for Codex).

  **Codex CEO voice** (via Bash):
  Command: `codex exec "You are a CEO/founder advisor reviewing a development plan.
  Challenge the strategic foundations: Are the premises valid or assumed? Is this the
  right problem to solve, or is there a reframing that would be 10x more impactful?
  What alternatives were dismissed too quickly? What competitive or market risks are
  unaddressed? What scope decisions will look foolish in 6 months? Be adversarial.
  No compliments. Just the strategic blind spots.
  File: <plan_path>" -s read-only --enable web_search_cached`
  Timeout: 10 minutes

  **Claude CEO subagent** (via Agent tool):
  "Read the plan file at <plan_path>. You are an independent CEO/strategist
  reviewing this plan. You have NOT seen any prior review. Evaluate:
  1. Is this the right problem to solve? Could a reframing yield 10x impact?
  2. Are the premises stated or just assumed? Which ones could be wrong?
  3. What's the 6-month regret scenario — what will look foolish?
  4. What alternatives were dismissed without sufficient analysis?
  5. What's the competitive risk — could someone else solve this first/better?
  For each finding: what's wrong, severity (critical/high/medium), and the fix."

  **Error handling:** All non-blocking. Codex auth/timeout/empty → proceed with
  Claude subagent only, tagged `[single-model]`. If Claude subagent also fails →
  "Outside voices unavailable — continuing with primary review."

  **Degradation matrix:** Both fail → "single-reviewer mode". Codex only →
  tag `[codex-only]`. Subagent only → tag `[subagent-only]`.

- Strategy choices: if codex disagrees with a premise or scope decision with valid
  strategic reason → TASTE DECISION.

**Required execution checklist (CEO):**



@@ 535,6 580,27 @@ Step 0 (0A-0F) — run each sub-step and produce:
- 0E: Temporal interrogation (HOUR 1 → HOUR 6+)
- 0F: Mode selection confirmation

Step 0.5 (Dual Voices): Run Claude subagent AND Codex simultaneously. Present
Codex output under CODEX SAYS (CEO — strategy challenge) header. Present subagent
output under CLAUDE SUBAGENT (CEO — strategic independence) header. Produce CEO
consensus table:

```
CEO DUAL VOICES — CONSENSUS TABLE:
═══════════════════════════════════════════════════════════════
  Dimension                           Claude  Codex  Consensus
  ──────────────────────────────────── ─────── ─────── ─────────
  1. Premises valid?                   —       —      —
  2. Right problem to solve?           —       —      —
  3. Scope calibration correct?        —       —      —
  4. Alternatives sufficiently explored?—      —      —
  5. Competitive/market risks covered? —       —      —
  6. 6-month trajectory sound?         —       —      —
═══════════════════════════════════════════════════════════════
CONFIRMED = both agree. DISAGREE = models differ (→ taste decision).
Missing voice = N/A (not CONFIRMED). Single critical finding from one voice = flagged regardless.
```

Sections 1-10 — for EACH section, run the evaluation criteria from the loaded skill file:
- Sections WITH findings: full analysis, auto-decide each issue, log to audit trail
- Sections with NO findings: 1-2 sentences stating what was examined and why nothing


@@ 549,8 615,23 @@ Sections 1-10 — for EACH section, run the evaluation criteria from the loaded 
- Dream state delta (where this plan leaves us vs 12-month ideal)
- Completion Summary (the full summary table from the CEO skill)

**PHASE 1 COMPLETE.** Emit phase-transition summary:
> **Phase 1 complete.** Codex: [N concerns]. Claude subagent: [N issues].
> Consensus: [X/6 confirmed, Y disagreements → surfaced at gate].
> Passing to Phase 2.

Do NOT begin Phase 2 until all Phase 1 outputs are written to the plan file
and the premise gate has been passed.

---

**Pre-Phase 2 checklist (verify before starting):**
- [ ] CEO completion summary written to plan file
- [ ] CEO dual voices ran (Codex + Claude subagent, or noted unavailable)
- [ ] CEO consensus table produced
- [ ] Premise gate passed (user confirmed)
- [ ] Phase-transition summary emitted

## Phase 2: Design Review (conditional — skip if no UI scope)

Follow plan-design-review/SKILL.md — all 7 dimensions, full depth.


@@ 561,19 642,102 @@ Override: every AskUserQuestion → auto-decide using the 6 principles.
- Structural issues (missing states, broken hierarchy): auto-fix (P5)
- Aesthetic/taste issues: mark TASTE DECISION
- Design system alignment: auto-fix if DESIGN.md exists and fix is obvious
- Dual voices: always run BOTH Claude subagent AND Codex if available (P6).

  **Codex design voice** (via Bash):
  Command: `codex exec "Read the plan file at <plan_path>. Evaluate this plan's
  UI/UX design decisions.

  Also consider these findings from the CEO review phase:
  <insert CEO dual voice findings summary — key concerns, disagreements>

  Does the information hierarchy serve the user or the developer? Are interaction
  states (loading, empty, error, partial) specified or left to the implementer's
  imagination? Is the responsive strategy intentional or afterthought? Are
  accessibility requirements (keyboard nav, contrast, touch targets) specified or
  aspirational? Does the plan describe specific UI decisions or generic patterns?
  What design decisions will haunt the implementer if left ambiguous?
  Be opinionated. No hedging." -s read-only --enable web_search_cached`
  Timeout: 10 minutes

  **Claude design subagent** (via Agent tool):
  "Read the plan file at <plan_path>. You are an independent senior product designer
  reviewing this plan. You have NOT seen any prior review. Evaluate:
  1. Information hierarchy: what does the user see first, second, third? Is it right?
  2. Missing states: loading, empty, error, success, partial — which are unspecified?
  3. User journey: what's the emotional arc? Where does it break?
  4. Specificity: does the plan describe SPECIFIC UI or generic patterns?
  5. What design decisions will haunt the implementer if left ambiguous?
  For each finding: what's wrong, severity (critical/high/medium), and the fix."
  NO prior-phase context — subagent must be truly independent.

  Error handling: same as Phase 1 (non-blocking, degradation matrix applies).

- Design choices: if codex disagrees with a design decision with valid UX reasoning
  → TASTE DECISION.

**Required execution checklist (Design):**

1. Step 0 (Design Scope): Rate completeness 0-10. Check DESIGN.md. Map existing patterns.

2. Step 0.5 (Dual Voices): Run Claude subagent AND Codex simultaneously. Present under
   CODEX SAYS (design — UX challenge) and CLAUDE SUBAGENT (design — independent review)
   headers. Produce design litmus scorecard (consensus table). Use the litmus scorecard
   format from plan-design-review. Include CEO phase findings in Codex prompt ONLY
   (not Claude subagent — stays independent).

3. Passes 1-7: Run each from loaded skill. Rate 0-10. Auto-decide each issue.
   DISAGREE items from scorecard → raised in the relevant pass with both perspectives.

**PHASE 2 COMPLETE.** Emit phase-transition summary:
> **Phase 2 complete.** Codex: [N concerns]. Claude subagent: [N issues].
> Consensus: [X/Y confirmed, Z disagreements → surfaced at gate].
> Passing to Phase 3.

Do NOT begin Phase 3 until all Phase 2 outputs (if run) are written to the plan file.

---

## Phase 3: Eng Review + Codex
**Pre-Phase 3 checklist (verify before starting):**
- [ ] All Phase 1 items above confirmed
- [ ] Design completion summary written (or "skipped, no UI scope")
- [ ] Design dual voices ran (if Phase 2 ran)
- [ ] Design consensus table produced (if Phase 2 ran)
- [ ] Phase-transition summary emitted

## Phase 3: Eng Review + Dual Voices

Follow plan-eng-review/SKILL.md — all sections, full depth.
Override: every AskUserQuestion → auto-decide using the 6 principles.

**Override rules:**
- Scope challenge: never reduce (P2)
- Codex review: always run if available (P6)
  Command: `codex exec "Review this plan for architectural issues, missing edge cases, and hidden complexity. Be adversarial. File: <plan_path>" -s read-only --enable web_search_cached`
  Timeout: 10 minutes, then proceed with "Codex timed out — single-reviewer mode"
- Dual voices: always run BOTH Claude subagent AND Codex if available (P6).

  **Codex eng voice** (via Bash):
  Command: `codex exec "Review this plan for architectural issues, missing edge cases,
  and hidden complexity. Be adversarial.

  Also consider these findings from prior review phases:
  CEO: <insert CEO consensus table summary — key concerns, DISAGREEs>
  Design: <insert Design consensus table summary, or 'skipped, no UI scope'>

  File: <plan_path>" -s read-only --enable web_search_cached`
  Timeout: 10 minutes

  **Claude eng subagent** (via Agent tool):
  "Read the plan file at <plan_path>. You are an independent senior engineer
  reviewing this plan. You have NOT seen any prior review. Evaluate:
  1. Architecture: Is the component structure sound? Coupling concerns?
  2. Edge cases: What breaks under 10x load? What's the nil/empty/error path?
  3. Tests: What's missing from the test plan? What would break at 2am Friday?
  4. Security: New attack surface? Auth boundaries? Input validation?
  5. Hidden complexity: What looks simple but isn't?
  For each finding: what's wrong, severity, and the fix."
  NO prior-phase context — subagent must be truly independent.

  Error handling: same as Phase 1 (non-blocking, degradation matrix applies).

- Architecture choices: explicit over clever (P5). If codex disagrees with valid reason → TASTE DECISION.
- Evals: always include all relevant suites (P1)
- Test plan: generate artifact at `~/.gstack/projects/$SLUG/{user}-{branch}-test-plan-{datetime}.md`


@@ 584,7 748,26 @@ Override: every AskUserQuestion → auto-decide using the 6 principles.
1. Step 0 (Scope Challenge): Read actual code referenced by the plan. Map each
   sub-problem to existing code. Run the complexity check. Produce concrete findings.

2. Step 0.5 (Codex): Run if available. Present full output under CODEX SAYS header.
2. Step 0.5 (Dual Voices): Run Claude subagent AND Codex simultaneously. Present
   Codex output under CODEX SAYS (eng — architecture challenge) header. Present subagent
   output under CLAUDE SUBAGENT (eng — independent review) header. Produce eng consensus
   table:

```
ENG DUAL VOICES — CONSENSUS TABLE:
═══════════════════════════════════════════════════════════════
  Dimension                           Claude  Codex  Consensus
  ──────────────────────────────────── ─────── ─────── ─────────
  1. Architecture sound?               —       —      —
  2. Test coverage sufficient?         —       —      —
  3. Performance risks addressed?      —       —      —
  4. Security threats covered?         —       —      —
  5. Error paths handled?              —       —      —
  6. Deployment risk manageable?       —       —      —
═══════════════════════════════════════════════════════════════
CONFIRMED = both agree. DISAGREE = models differ (→ taste decision).
Missing voice = N/A (not CONFIRMED). Single critical finding from one voice = flagged regardless.
```

3. Section 1 (Architecture): Produce ASCII dependency graph showing new components
   and their relationships to existing ones. Evaluate coupling, scaling, security.


@@ 648,10 831,14 @@ produced. Check the plan file and conversation for each item.
- [ ] "What already exists" section written
- [ ] Dream state delta written
- [ ] Completion Summary produced
- [ ] Dual voices ran (Codex + Claude subagent, or noted unavailable)
- [ ] CEO consensus table produced

**Phase 2 (Design) outputs — only if UI scope detected:**
- [ ] All 7 dimensions evaluated with scores
- [ ] Issues identified and auto-decided
- [ ] Dual voices ran (or noted unavailable/skipped with phase)
- [ ] Design litmus scorecard produced

**Phase 3 (Eng) outputs:**
- [ ] Scope challenge with actual code analysis (not just "scope is fine")


@@ 662,6 849,11 @@ produced. Check the plan file and conversation for each item.
- [ ] "What already exists" section written
- [ ] Failure modes registry with critical gap assessment
- [ ] Completion Summary produced
- [ ] Dual voices ran (Codex + Claude subagent, or noted unavailable)
- [ ] Eng consensus table produced

**Cross-phase:**
- [ ] Cross-phase themes section written

**Audit trail:**
- [ ] Decision Audit Trail has at least one row per auto-decision (not empty)


@@ 696,9 888,16 @@ I recommend [X] — [principle]. But [Y] is also viable:

### Review Scores
- CEO: [summary]
- CEO Voices: Codex [summary], Claude subagent [summary], Consensus [X/6 confirmed]
- Design: [summary or "skipped, no UI scope"]
- Design Voices: Codex [summary], Claude subagent [summary], Consensus [X/7 confirmed] (or "skipped")
- Eng: [summary]
- Codex: [summary or "unavailable"]
- Eng Voices: Codex [summary], Claude subagent [summary], Consensus [X/6 confirmed]

### Cross-Phase Themes
[For any concern that appeared in 2+ phases' dual voices independently:]
**Theme: [topic]** — flagged in [Phase 1, Phase 3]. High-confidence signal.
[If no themes span phases:] "No cross-phase themes — each phase's concerns were distinct."

### Deferred to TODOS.md
[Items auto-deferred with reasons]


@@ 745,6 944,21 @@ If Phase 2 ran (UI scope):

Replace field values with actual counts from the review.

Dual voice logs (one per phase that ran):
```bash
~/.claude/skills/gstack/bin/gstack-review-log '{"skill":"autoplan-voices","timestamp":"'"$TIMESTAMP"'","status":"STATUS","source":"SOURCE","phase":"ceo","via":"autoplan","consensus_confirmed":N,"consensus_disagree":N,"commit":"'"$COMMIT"'"}'

~/.claude/skills/gstack/bin/gstack-review-log '{"skill":"autoplan-voices","timestamp":"'"$TIMESTAMP"'","status":"STATUS","source":"SOURCE","phase":"eng","via":"autoplan","consensus_confirmed":N,"consensus_disagree":N,"commit":"'"$COMMIT"'"}'
```

If Phase 2 ran (UI scope), also log:
```bash
~/.claude/skills/gstack/bin/gstack-review-log '{"skill":"autoplan-voices","timestamp":"'"$TIMESTAMP"'","status":"STATUS","source":"SOURCE","phase":"design","via":"autoplan","consensus_confirmed":N,"consensus_disagree":N,"commit":"'"$COMMIT"'"}'
```

SOURCE = "codex+subagent", "codex-only", "subagent-only", or "unavailable".
Replace N values with actual consensus counts from the tables.

Suggest next step: `/ship` when ready to create the PR.

---

M autoplan/SKILL.md.tmpl => autoplan/SKILL.md.tmpl +220 -6
@@ 72,6 72,17 @@ Examples: run codex (always yes), run evals (always yes), reduce scope on a comp

---

## Sequential Execution — MANDATORY

Phases MUST execute in strict order: CEO → Design → Eng.
Each phase MUST complete fully before the next begins.
NEVER run phases in parallel — each builds on the previous.

Between each phase, emit a phase-transition summary and verify that all required
outputs from the prior phase are written before starting the next.

---

## What "Auto-Decide" Means

Auto-decide replaces the USER'S judgment with the 6 principles. It does NOT replace


@@ 157,6 168,8 @@ Read each file using the Read tool:
- Review Readiness Dashboard
- Plan File Review Report
- Prerequisite Skill Offer (BENEFITS_FROM)
- Outside Voice — Independent Plan Challenge
- Design Outside Voices (parallel)

Follow ONLY the review-specific methodology, sections, and required outputs.



@@ 180,6 193,38 @@ Override: every AskUserQuestion → auto-decide using the 6 principles.
- Scope expansion: in blast radius + <1d CC → approve (P2). Outside → defer to TODOS.md (P3).
  Duplicates → reject (P4). Borderline (3-5 files) → mark TASTE DECISION.
- All 10 review sections: run fully, auto-decide each issue, log every decision.
- Dual voices: always run BOTH Claude subagent AND Codex if available (P6).
  Run them simultaneously (Agent tool for subagent, Bash for Codex).

  **Codex CEO voice** (via Bash):
  Command: `codex exec "You are a CEO/founder advisor reviewing a development plan.
  Challenge the strategic foundations: Are the premises valid or assumed? Is this the
  right problem to solve, or is there a reframing that would be 10x more impactful?
  What alternatives were dismissed too quickly? What competitive or market risks are
  unaddressed? What scope decisions will look foolish in 6 months? Be adversarial.
  No compliments. Just the strategic blind spots.
  File: <plan_path>" -s read-only --enable web_search_cached`
  Timeout: 10 minutes

  **Claude CEO subagent** (via Agent tool):
  "Read the plan file at <plan_path>. You are an independent CEO/strategist
  reviewing this plan. You have NOT seen any prior review. Evaluate:
  1. Is this the right problem to solve? Could a reframing yield 10x impact?
  2. Are the premises stated or just assumed? Which ones could be wrong?
  3. What's the 6-month regret scenario — what will look foolish?
  4. What alternatives were dismissed without sufficient analysis?
  5. What's the competitive risk — could someone else solve this first/better?
  For each finding: what's wrong, severity (critical/high/medium), and the fix."

  **Error handling:** All non-blocking. Codex auth/timeout/empty → proceed with
  Claude subagent only, tagged `[single-model]`. If Claude subagent also fails →
  "Outside voices unavailable — continuing with primary review."

  **Degradation matrix:** Both fail → "single-reviewer mode". Codex only →
  tag `[codex-only]`. Subagent only → tag `[subagent-only]`.

- Strategy choices: if codex disagrees with a premise or scope decision with valid
  strategic reason → TASTE DECISION.

**Required execution checklist (CEO):**



@@ 192,6 237,27 @@ Step 0 (0A-0F) — run each sub-step and produce:
- 0E: Temporal interrogation (HOUR 1 → HOUR 6+)
- 0F: Mode selection confirmation

Step 0.5 (Dual Voices): Run Claude subagent AND Codex simultaneously. Present
Codex output under CODEX SAYS (CEO — strategy challenge) header. Present subagent
output under CLAUDE SUBAGENT (CEO — strategic independence) header. Produce CEO
consensus table:

```
CEO DUAL VOICES — CONSENSUS TABLE:
═══════════════════════════════════════════════════════════════
  Dimension                           Claude  Codex  Consensus
  ──────────────────────────────────── ─────── ─────── ─────────
  1. Premises valid?                   —       —      —
  2. Right problem to solve?           —       —      —
  3. Scope calibration correct?        —       —      —
  4. Alternatives sufficiently explored?—      —      —
  5. Competitive/market risks covered? —       —      —
  6. 6-month trajectory sound?         —       —      —
═══════════════════════════════════════════════════════════════
CONFIRMED = both agree. DISAGREE = models differ (→ taste decision).
Missing voice = N/A (not CONFIRMED). Single critical finding from one voice = flagged regardless.
```

Sections 1-10 — for EACH section, run the evaluation criteria from the loaded skill file:
- Sections WITH findings: full analysis, auto-decide each issue, log to audit trail
- Sections with NO findings: 1-2 sentences stating what was examined and why nothing


@@ 206,8 272,23 @@ Sections 1-10 — for EACH section, run the evaluation criteria from the loaded 
- Dream state delta (where this plan leaves us vs 12-month ideal)
- Completion Summary (the full summary table from the CEO skill)

**PHASE 1 COMPLETE.** Emit phase-transition summary:
> **Phase 1 complete.** Codex: [N concerns]. Claude subagent: [N issues].
> Consensus: [X/6 confirmed, Y disagreements → surfaced at gate].
> Passing to Phase 2.

Do NOT begin Phase 2 until all Phase 1 outputs are written to the plan file
and the premise gate has been passed.

---

**Pre-Phase 2 checklist (verify before starting):**
- [ ] CEO completion summary written to plan file
- [ ] CEO dual voices ran (Codex + Claude subagent, or noted unavailable)
- [ ] CEO consensus table produced
- [ ] Premise gate passed (user confirmed)
- [ ] Phase-transition summary emitted

## Phase 2: Design Review (conditional — skip if no UI scope)

Follow plan-design-review/SKILL.md — all 7 dimensions, full depth.


@@ 218,19 299,102 @@ Override: every AskUserQuestion → auto-decide using the 6 principles.
- Structural issues (missing states, broken hierarchy): auto-fix (P5)
- Aesthetic/taste issues: mark TASTE DECISION
- Design system alignment: auto-fix if DESIGN.md exists and fix is obvious
- Dual voices: always run BOTH Claude subagent AND Codex if available (P6).

  **Codex design voice** (via Bash):
  Command: `codex exec "Read the plan file at <plan_path>. Evaluate this plan's
  UI/UX design decisions.

  Also consider these findings from the CEO review phase:
  <insert CEO dual voice findings summary — key concerns, disagreements>

  Does the information hierarchy serve the user or the developer? Are interaction
  states (loading, empty, error, partial) specified or left to the implementer's
  imagination? Is the responsive strategy intentional or afterthought? Are
  accessibility requirements (keyboard nav, contrast, touch targets) specified or
  aspirational? Does the plan describe specific UI decisions or generic patterns?
  What design decisions will haunt the implementer if left ambiguous?
  Be opinionated. No hedging." -s read-only --enable web_search_cached`
  Timeout: 10 minutes

  **Claude design subagent** (via Agent tool):
  "Read the plan file at <plan_path>. You are an independent senior product designer
  reviewing this plan. You have NOT seen any prior review. Evaluate:
  1. Information hierarchy: what does the user see first, second, third? Is it right?
  2. Missing states: loading, empty, error, success, partial — which are unspecified?
  3. User journey: what's the emotional arc? Where does it break?
  4. Specificity: does the plan describe SPECIFIC UI or generic patterns?
  5. What design decisions will haunt the implementer if left ambiguous?
  For each finding: what's wrong, severity (critical/high/medium), and the fix."
  NO prior-phase context — subagent must be truly independent.

  Error handling: same as Phase 1 (non-blocking, degradation matrix applies).

- Design choices: if codex disagrees with a design decision with valid UX reasoning
  → TASTE DECISION.

**Required execution checklist (Design):**

1. Step 0 (Design Scope): Rate completeness 0-10. Check DESIGN.md. Map existing patterns.

2. Step 0.5 (Dual Voices): Run Claude subagent AND Codex simultaneously. Present under
   CODEX SAYS (design — UX challenge) and CLAUDE SUBAGENT (design — independent review)
   headers. Produce design litmus scorecard (consensus table). Use the litmus scorecard
   format from plan-design-review. Include CEO phase findings in Codex prompt ONLY
   (not Claude subagent — stays independent).

3. Passes 1-7: Run each from loaded skill. Rate 0-10. Auto-decide each issue.
   DISAGREE items from scorecard → raised in the relevant pass with both perspectives.

**PHASE 2 COMPLETE.** Emit phase-transition summary:
> **Phase 2 complete.** Codex: [N concerns]. Claude subagent: [N issues].
> Consensus: [X/Y confirmed, Z disagreements → surfaced at gate].
> Passing to Phase 3.

Do NOT begin Phase 3 until all Phase 2 outputs (if run) are written to the plan file.

---

## Phase 3: Eng Review + Codex
**Pre-Phase 3 checklist (verify before starting):**
- [ ] All Phase 1 items above confirmed
- [ ] Design completion summary written (or "skipped, no UI scope")
- [ ] Design dual voices ran (if Phase 2 ran)
- [ ] Design consensus table produced (if Phase 2 ran)
- [ ] Phase-transition summary emitted

## Phase 3: Eng Review + Dual Voices

Follow plan-eng-review/SKILL.md — all sections, full depth.
Override: every AskUserQuestion → auto-decide using the 6 principles.

**Override rules:**
- Scope challenge: never reduce (P2)
- Codex review: always run if available (P6)
  Command: `codex exec "Review this plan for architectural issues, missing edge cases, and hidden complexity. Be adversarial. File: <plan_path>" -s read-only --enable web_search_cached`
  Timeout: 10 minutes, then proceed with "Codex timed out — single-reviewer mode"
- Dual voices: always run BOTH Claude subagent AND Codex if available (P6).

  **Codex eng voice** (via Bash):
  Command: `codex exec "Review this plan for architectural issues, missing edge cases,
  and hidden complexity. Be adversarial.

  Also consider these findings from prior review phases:
  CEO: <insert CEO consensus table summary — key concerns, DISAGREEs>
  Design: <insert Design consensus table summary, or 'skipped, no UI scope'>

  File: <plan_path>" -s read-only --enable web_search_cached`
  Timeout: 10 minutes

  **Claude eng subagent** (via Agent tool):
  "Read the plan file at <plan_path>. You are an independent senior engineer
  reviewing this plan. You have NOT seen any prior review. Evaluate:
  1. Architecture: Is the component structure sound? Coupling concerns?
  2. Edge cases: What breaks under 10x load? What's the nil/empty/error path?
  3. Tests: What's missing from the test plan? What would break at 2am Friday?
  4. Security: New attack surface? Auth boundaries? Input validation?
  5. Hidden complexity: What looks simple but isn't?
  For each finding: what's wrong, severity, and the fix."
  NO prior-phase context — subagent must be truly independent.

  Error handling: same as Phase 1 (non-blocking, degradation matrix applies).

- Architecture choices: explicit over clever (P5). If codex disagrees with valid reason → TASTE DECISION.
- Evals: always include all relevant suites (P1)
- Test plan: generate artifact at `~/.gstack/projects/$SLUG/{user}-{branch}-test-plan-{datetime}.md`


@@ 241,7 405,26 @@ Override: every AskUserQuestion → auto-decide using the 6 principles.
1. Step 0 (Scope Challenge): Read actual code referenced by the plan. Map each
   sub-problem to existing code. Run the complexity check. Produce concrete findings.

2. Step 0.5 (Codex): Run if available. Present full output under CODEX SAYS header.
2. Step 0.5 (Dual Voices): Run Claude subagent AND Codex simultaneously. Present
   Codex output under CODEX SAYS (eng — architecture challenge) header. Present subagent
   output under CLAUDE SUBAGENT (eng — independent review) header. Produce eng consensus
   table:

```
ENG DUAL VOICES — CONSENSUS TABLE:
═══════════════════════════════════════════════════════════════
  Dimension                           Claude  Codex  Consensus
  ──────────────────────────────────── ─────── ─────── ─────────
  1. Architecture sound?               —       —      —
  2. Test coverage sufficient?         —       —      —
  3. Performance risks addressed?      —       —      —
  4. Security threats covered?         —       —      —
  5. Error paths handled?              —       —      —
  6. Deployment risk manageable?       —       —      —
═══════════════════════════════════════════════════════════════
CONFIRMED = both agree. DISAGREE = models differ (→ taste decision).
Missing voice = N/A (not CONFIRMED). Single critical finding from one voice = flagged regardless.
```

3. Section 1 (Architecture): Produce ASCII dependency graph showing new components
   and their relationships to existing ones. Evaluate coupling, scaling, security.


@@ 305,10 488,14 @@ produced. Check the plan file and conversation for each item.
- [ ] "What already exists" section written
- [ ] Dream state delta written
- [ ] Completion Summary produced
- [ ] Dual voices ran (Codex + Claude subagent, or noted unavailable)
- [ ] CEO consensus table produced

**Phase 2 (Design) outputs — only if UI scope detected:**
- [ ] All 7 dimensions evaluated with scores
- [ ] Issues identified and auto-decided
- [ ] Dual voices ran (or noted unavailable/skipped with phase)
- [ ] Design litmus scorecard produced

**Phase 3 (Eng) outputs:**
- [ ] Scope challenge with actual code analysis (not just "scope is fine")


@@ 319,6 506,11 @@ produced. Check the plan file and conversation for each item.
- [ ] "What already exists" section written
- [ ] Failure modes registry with critical gap assessment
- [ ] Completion Summary produced
- [ ] Dual voices ran (Codex + Claude subagent, or noted unavailable)
- [ ] Eng consensus table produced

**Cross-phase:**
- [ ] Cross-phase themes section written

**Audit trail:**
- [ ] Decision Audit Trail has at least one row per auto-decision (not empty)


@@ 353,9 545,16 @@ I recommend [X] — [principle]. But [Y] is also viable:

### Review Scores
- CEO: [summary]
- CEO Voices: Codex [summary], Claude subagent [summary], Consensus [X/6 confirmed]
- Design: [summary or "skipped, no UI scope"]
- Design Voices: Codex [summary], Claude subagent [summary], Consensus [X/7 confirmed] (or "skipped")
- Eng: [summary]
- Codex: [summary or "unavailable"]
- Eng Voices: Codex [summary], Claude subagent [summary], Consensus [X/6 confirmed]

### Cross-Phase Themes
[For any concern that appeared in 2+ phases' dual voices independently:]
**Theme: [topic]** — flagged in [Phase 1, Phase 3]. High-confidence signal.
[If no themes span phases:] "No cross-phase themes — each phase's concerns were distinct."

### Deferred to TODOS.md
[Items auto-deferred with reasons]


@@ 402,6 601,21 @@ If Phase 2 ran (UI scope):

Replace field values with actual counts from the review.

Dual voice logs (one per phase that ran):
```bash
~/.claude/skills/gstack/bin/gstack-review-log '{"skill":"autoplan-voices","timestamp":"'"$TIMESTAMP"'","status":"STATUS","source":"SOURCE","phase":"ceo","via":"autoplan","consensus_confirmed":N,"consensus_disagree":N,"commit":"'"$COMMIT"'"}'

~/.claude/skills/gstack/bin/gstack-review-log '{"skill":"autoplan-voices","timestamp":"'"$TIMESTAMP"'","status":"STATUS","source":"SOURCE","phase":"eng","via":"autoplan","consensus_confirmed":N,"consensus_disagree":N,"commit":"'"$COMMIT"'"}'
```

If Phase 2 ran (UI scope), also log:
```bash
~/.claude/skills/gstack/bin/gstack-review-log '{"skill":"autoplan-voices","timestamp":"'"$TIMESTAMP"'","status":"STATUS","source":"SOURCE","phase":"design","via":"autoplan","consensus_confirmed":N,"consensus_disagree":N,"commit":"'"$COMMIT"'"}'
```

SOURCE = "codex+subagent", "codex-only", "subagent-only", or "unavailable".
Replace N values with actual consensus counts from the tables.

Suggest next step: `/ship` when ready to create the PR.

---