~cytrogen/gstack

9811ed37bf05a6cfa85ce3a1b00d8db5edfa44ca — Garry Tan a month ago d7c732b
feat: default codex reviews in /ship and /review (v0.9.4.0) (#256)

* feat: default codex reviews in /ship and /review with xhigh reasoning

Codex code reviews are now opt-in-once-then-always-on via a one-time
adoption prompt. When enabled, both review + adversarial run automatically
on every /ship and /review — no more choosing between them.

Key changes:
- New {{CODEX_REVIEW_STEP}} resolver centralizes Codex review logic (DRY)
- Three-state config: enabled/not-set/disabled via gstack-config
- P1 findings default to "Investigate and fix" instead of "Ship anyway"
- All reasoning bumped to xhigh (review, adversarial, consult)
- Codex review step stripped from codex-host variants (no self-invocation)
- Ship "Never ask" rule updated to accurately list quality-gate stops
- Error handling for auth, timeout, empty response (all non-blocking)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: update touchfiles test for plan-ceo-review-benefits dependency

The merge from main added plan-ceo-review-benefits to E2E_TOUCHFILES,
which means plan-ceo-review/SKILL.md now selects 3 tests, not 2.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* feat: default codex reviews in /ship and /review (v0.9.4.0)

Codex code reviews now run automatically — both review + adversarial
challenge — with a one-time opt-in prompt for new users. All modes use
xhigh reasoning. Codex-host builds strip the step to prevent recursion.

Fixes from Codex review: TMPERR properly defined, stderr captured for
both review and adversarial, error handling before log persist, commit
hash included in review log for staleness tracking.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
M .agents/skills/gstack-plan-ceo-review/SKILL.md => .agents/skills/gstack-plan-ceo-review/SKILL.md +1 -1
@@ 993,7 993,7 @@ Parse the output. Find the most recent entry for each skill (plan-ceo-review, pl
- **Eng Review (required by default):** The only review that gates shipping. Covers architecture, code quality, tests, performance. Can be disabled globally with \`gstack-config set skip_eng_review true\` (the "don't bother me" setting).
- **CEO Review (optional):** Use your judgment. Recommend it for big product/business changes, new user-facing features, or scope decisions. Skip for bug fixes, refactors, infra, and cleanup.
- **Design Review (optional):** Use your judgment. Recommend it for UI/UX changes. Skip for backend-only, infra, or prompt-only changes.
- **Codex Review (optional):** Independent second opinion from OpenAI Codex CLI. Shows pass/fail gate. Recommend for critical code changes where a second AI perspective adds value. Skip when Codex CLI is not installed.
- **Codex Review (enabled by default when Codex CLI is installed):** Independent review + adversarial challenge from OpenAI Codex CLI. Shows pass/fail gate. Runs automatically when enabled — configure with \`gstack-config set codex_reviews enabled|disabled\`.

**Verdict logic:**
- **CLEARED**: Eng Review has >= 1 entry within 7 days with status "clean" (or \`skip_eng_review\` is \`true\`)

M .agents/skills/gstack-plan-design-review/SKILL.md => .agents/skills/gstack-plan-design-review/SKILL.md +1 -1
@@ 528,7 528,7 @@ Parse the output. Find the most recent entry for each skill (plan-ceo-review, pl
- **Eng Review (required by default):** The only review that gates shipping. Covers architecture, code quality, tests, performance. Can be disabled globally with \`gstack-config set skip_eng_review true\` (the "don't bother me" setting).
- **CEO Review (optional):** Use your judgment. Recommend it for big product/business changes, new user-facing features, or scope decisions. Skip for bug fixes, refactors, infra, and cleanup.
- **Design Review (optional):** Use your judgment. Recommend it for UI/UX changes. Skip for backend-only, infra, or prompt-only changes.
- **Codex Review (optional):** Independent second opinion from OpenAI Codex CLI. Shows pass/fail gate. Recommend for critical code changes where a second AI perspective adds value. Skip when Codex CLI is not installed.
- **Codex Review (enabled by default when Codex CLI is installed):** Independent review + adversarial challenge from OpenAI Codex CLI. Shows pass/fail gate. Runs automatically when enabled — configure with \`gstack-config set codex_reviews enabled|disabled\`.

**Verdict logic:**
- **CLEARED**: Eng Review has >= 1 entry within 7 days with status "clean" (or \`skip_eng_review\` is \`true\`)

M .agents/skills/gstack-plan-eng-review/SKILL.md => .agents/skills/gstack-plan-eng-review/SKILL.md +1 -1
@@ 517,7 517,7 @@ Parse the output. Find the most recent entry for each skill (plan-ceo-review, pl
- **Eng Review (required by default):** The only review that gates shipping. Covers architecture, code quality, tests, performance. Can be disabled globally with \`gstack-config set skip_eng_review true\` (the "don't bother me" setting).
- **CEO Review (optional):** Use your judgment. Recommend it for big product/business changes, new user-facing features, or scope decisions. Skip for bug fixes, refactors, infra, and cleanup.
- **Design Review (optional):** Use your judgment. Recommend it for UI/UX changes. Skip for backend-only, infra, or prompt-only changes.
- **Codex Review (optional):** Independent second opinion from OpenAI Codex CLI. Shows pass/fail gate. Recommend for critical code changes where a second AI perspective adds value. Skip when Codex CLI is not installed.
- **Codex Review (enabled by default when Codex CLI is installed):** Independent review + adversarial challenge from OpenAI Codex CLI. Shows pass/fail gate. Runs automatically when enabled — configure with \`gstack-config set codex_reviews enabled|disabled\`.

**Verdict logic:**
- **CLEARED**: Eng Review has >= 1 entry within 7 days with status "clean" (or \`skip_eng_review\` is \`true\`)

M .agents/skills/gstack-review/SKILL.md => .agents/skills/gstack-review/SKILL.md +0 -47
@@ 474,54 474,7 @@ If no documentation files exist, skip this step silently.

---

## Step 5.7: Codex second opinion (optional)

After completing the review, check if the Codex CLI is available:

```bash
which codex 2>/dev/null && echo "CODEX_AVAILABLE" || echo "CODEX_NOT_AVAILABLE"
```

If Codex is available, use AskUserQuestion:

```
Review complete. Want an independent second opinion from Codex (OpenAI)?

A) Run Codex code review — independent diff review with pass/fail gate
B) Run Codex adversarial challenge — try to find ways this code will fail in production
C) Both — review first, then adversarial challenge
D) Skip — no Codex review needed
```

If the user chooses A, B, or C:

**For code review (A or C):** Run `codex review --base <base>` with a 5-minute timeout.
Present the full output verbatim under a `CODEX SAYS (code review):` header.
Check the output for `[P1]` markers — if found, note `GATE: FAIL`, otherwise `GATE: PASS`.
After presenting, compare Codex's findings with your own review findings from Steps 4-5
and output a CROSS-MODEL ANALYSIS showing what both found, what only Codex found,
and what only Claude found.

**For adversarial challenge (B or C):** Run:
```bash
codex exec "Review the changes on this branch against the base branch. Run git diff origin/<base> to see the diff. Your job is to find ways this code will fail in production. Think like an attacker and a chaos engineer. Find edge cases, race conditions, security holes, failure modes. Be adversarial." -s read-only
```
Present the full output verbatim under a `CODEX SAYS (adversarial challenge):` header.

**Only if a code review ran (user chose A or C):** Persist the Codex review result to the review log:
```bash
~/.codex/skills/gstack/bin/gstack-review-log '{"skill":"codex-review","timestamp":"'"$(date -u +%Y-%m-%dT%H:%M:%SZ)"'","status":"STATUS","gate":"GATE"}'
```

Substitute: STATUS ("clean" if PASS, "issues_found" if FAIL), GATE ("pass" or "fail").

**Do NOT persist a codex-review entry when only the adversarial challenge (B) ran** —
there is no gate verdict to record, and a false entry would make the Review Readiness
Dashboard believe a code review happened when it didn't.

If Codex is not available, skip this step silently.

---

## Important Rules


M .agents/skills/gstack-ship/SKILL.md => .agents/skills/gstack-ship/SKILL.md +2 -38
@@ 295,7 295,7 @@ Parse the output. Find the most recent entry for each skill (plan-ceo-review, pl
- **Eng Review (required by default):** The only review that gates shipping. Covers architecture, code quality, tests, performance. Can be disabled globally with \`gstack-config set skip_eng_review true\` (the "don't bother me" setting).
- **CEO Review (optional):** Use your judgment. Recommend it for big product/business changes, new user-facing features, or scope decisions. Skip for bug fixes, refactors, infra, and cleanup.
- **Design Review (optional):** Use your judgment. Recommend it for UI/UX changes. Skip for backend-only, infra, or prompt-only changes.
- **Codex Review (optional):** Independent second opinion from OpenAI Codex CLI. Shows pass/fail gate. Recommend for critical code changes where a second AI perspective adds value. Skip when Codex CLI is not installed.
- **Codex Review (enabled by default when Codex CLI is installed):** Independent review + adversarial challenge from OpenAI Codex CLI. Shows pass/fail gate. Runs automatically when enabled — configure with \`gstack-config set codex_reviews enabled|disabled\`.

**Verdict logic:**
- **CLEARED**: Eng Review has >= 1 entry within 7 days with status "clean" (or \`skip_eng_review\` is \`true\`)


@@ 837,43 837,7 @@ For each classified comment:

---

## Step 3.8: Codex second opinion (optional)

Check if the Codex CLI is available:

```bash
which codex 2>/dev/null && echo "CODEX_AVAILABLE" || echo "CODEX_NOT_AVAILABLE"
```

If Codex is available, use AskUserQuestion:

```
Pre-landing review complete. Want an independent Codex (OpenAI) review before shipping?

A) Run Codex code review — independent diff review with pass/fail gate
B) Run Codex adversarial challenge — try to break this code
C) Skip — ship without Codex review
```

If the user chooses A or B:

**For code review (A):** Run `codex review --base <base>` with a 5-minute timeout.
Present the full output verbatim under a `CODEX SAYS:` header. Check for `[P1]` markers
to determine pass/fail gate. Persist the result:

```bash
~/.codex/skills/gstack/bin/gstack-review-log '{"skill":"codex-review","timestamp":"TIMESTAMP","status":"STATUS","gate":"GATE"}'
```

If GATE is FAIL, use AskUserQuestion: "Codex found critical issues. Ship anyway?"
If the user says no, stop. If yes, continue to Step 4.

**For adversarial (B):** Run codex exec with the adversarial prompt (see /codex skill).
Present findings. This is informational — does not block shipping.

If Codex is not available, skip silently. Continue to Step 4.

---

## Step 4: Version bump (auto-decide)



@@ 1114,7 1078,7 @@ doc updates — the user runs `/ship` and documentation stays current without a 
- **Never skip tests.** If tests fail, stop.
- **Never skip the pre-landing review.** If checklist.md is unreadable, stop.
- **Never force push.** Use regular `git push` only.
- **Never ask for confirmation** except for MINOR/MAJOR version bumps and pre-landing review ASK items (batched into at most one AskUserQuestion).
- **Never ask for trivial confirmations** (e.g., "ready to push?", "create PR?"). DO stop for: version bumps (MINOR/MAJOR), pre-landing review findings (ASK items), Codex critical findings ([P1]), and the one-time Codex adoption prompt.
- **Always use the 4-digit version format** from the VERSION file.
- **Date format in CHANGELOG:** `YYYY-MM-DD`
- **Split commits for bisectability** — each commit = one logical change.

M CHANGELOG.md => CHANGELOG.md +13 -0
@@ 1,5 1,18 @@
# Changelog

## [0.9.4.0] - 2026-03-20 — Codex Reviews On By Default

### Changed

- **Codex code reviews now run automatically in `/ship` and `/review`.** No more "want a second opinion?" prompt every time — Codex reviews both your code (with a pass/fail gate) and runs an adversarial challenge by default. First-time users get a one-time opt-in prompt; after that, it's hands-free. Configure with `gstack-config set codex_reviews enabled|disabled`.
- **All Codex operations use maximum reasoning power.** Review, adversarial, and consult modes all use `xhigh` reasoning effort — when an AI is reviewing your code, you want it thinking as hard as possible.
- **Codex review errors can't corrupt the dashboard.** Auth failures, timeouts, and empty responses are now detected before logging results, so the Review Readiness Dashboard never shows a false "passed" entry. Adversarial stderr is captured separately.
- **Codex review log includes commit hash.** Staleness detection now works correctly for Codex reviews, matching the same commit-tracking behavior as eng/CEO/design reviews.

### Fixed

- **Codex-for-Codex recursion prevented.** When gstack runs inside Codex CLI (`.agents/skills/`), the Codex review step is completely stripped — no accidental infinite loops.

## [0.9.3.0] - 2026-03-20 — Windows Support

### Fixed

M VERSION => VERSION +1 -1
@@ 1,1 1,1 @@
0.9.3.0
0.9.4.0

M codex/SKILL.md => codex/SKILL.md +5 -8
@@ 300,13 300,13 @@ TMPERR=$(mktemp /tmp/codex-err-XXXXXX.txt)

2. Run the review (5-minute timeout):
```bash
codex review --base <base> -c 'model_reasoning_effort="high"' --enable web_search_cached 2>"$TMPERR"
codex review --base <base> -c 'model_reasoning_effort="xhigh"' --enable web_search_cached 2>"$TMPERR"
```

Use `timeout: 300000` on the Bash call. If the user provided custom instructions
(e.g., `/codex review focus on security`), pass them as the prompt argument:
```bash
codex review "focus on security" --base <base> -c 'model_reasoning_effort="high"' --enable web_search_cached 2>"$TMPERR"
codex review "focus on security" --base <base> -c 'model_reasoning_effort="xhigh"' --enable web_search_cached 2>"$TMPERR"
```

3. Capture the output. Then parse cost from stderr:


@@ 461,7 461,7 @@ THE PLAN:

For a **new session:**
```bash
codex exec "<prompt>" -s read-only -c 'model_reasoning_effort="high"' --enable web_search_cached --json 2>"$TMPERR" | python3 -c "
codex exec "<prompt>" -s read-only -c 'model_reasoning_effort="xhigh"' --enable web_search_cached --json 2>"$TMPERR" | python3 -c "
import sys, json
for line in sys.stdin:
    line = line.strip()


@@ 494,7 494,7 @@ for line in sys.stdin:

For a **resumed session** (user chose "Continue"):
```bash
codex exec resume <session-id> "<prompt>" -s read-only -c 'model_reasoning_effort="high"' --enable web_search_cached --json 2>"$TMPERR" | python3 -c "
codex exec resume <session-id> "<prompt>" -s read-only -c 'model_reasoning_effort="xhigh"' --enable web_search_cached --json 2>"$TMPERR" | python3 -c "
<same python streaming parser as above>
"
```


@@ 530,10 530,7 @@ Session saved — run /codex again to continue this conversation.
agentic coding model). This means as OpenAI ships newer models, /codex automatically
uses them. If the user wants a specific model, pass `-m` through to codex.

**Reasoning effort** varies by mode — use the right level for each task:
- **Review mode:** `high` — thorough but not slow. Diff review benefits from depth but doesn't need maximum compute.
- **Challenge (adversarial) mode:** `xhigh` — maximum reasoning power. When trying to break code, you want the model thinking as hard as possible.
- **Consult mode:** `high` — good balance of depth and speed for conversations.
**Reasoning effort:** All modes use `xhigh` — maximum reasoning power. When reviewing code, breaking code, or consulting on architecture, you want the model thinking as hard as possible.

**Web search:** All codex commands use `--enable web_search_cached` so Codex can look up
docs and APIs during review. This is OpenAI's cached index — fast, no extra cost.

M codex/SKILL.md.tmpl => codex/SKILL.md.tmpl +5 -8
@@ 79,13 79,13 @@ TMPERR=$(mktemp /tmp/codex-err-XXXXXX.txt)

2. Run the review (5-minute timeout):
```bash
codex review --base <base> -c 'model_reasoning_effort="high"' --enable web_search_cached 2>"$TMPERR"
codex review --base <base> -c 'model_reasoning_effort="xhigh"' --enable web_search_cached 2>"$TMPERR"
```

Use `timeout: 300000` on the Bash call. If the user provided custom instructions
(e.g., `/codex review focus on security`), pass them as the prompt argument:
```bash
codex review "focus on security" --base <base> -c 'model_reasoning_effort="high"' --enable web_search_cached 2>"$TMPERR"
codex review "focus on security" --base <base> -c 'model_reasoning_effort="xhigh"' --enable web_search_cached 2>"$TMPERR"
```

3. Capture the output. Then parse cost from stderr:


@@ 240,7 240,7 @@ THE PLAN:

For a **new session:**
```bash
codex exec "<prompt>" -s read-only -c 'model_reasoning_effort="high"' --enable web_search_cached --json 2>"$TMPERR" | python3 -c "
codex exec "<prompt>" -s read-only -c 'model_reasoning_effort="xhigh"' --enable web_search_cached --json 2>"$TMPERR" | python3 -c "
import sys, json
for line in sys.stdin:
    line = line.strip()


@@ 273,7 273,7 @@ for line in sys.stdin:

For a **resumed session** (user chose "Continue"):
```bash
codex exec resume <session-id> "<prompt>" -s read-only -c 'model_reasoning_effort="high"' --enable web_search_cached --json 2>"$TMPERR" | python3 -c "
codex exec resume <session-id> "<prompt>" -s read-only -c 'model_reasoning_effort="xhigh"' --enable web_search_cached --json 2>"$TMPERR" | python3 -c "
<same python streaming parser as above>
"
```


@@ 309,10 309,7 @@ Session saved — run /codex again to continue this conversation.
agentic coding model). This means as OpenAI ships newer models, /codex automatically
uses them. If the user wants a specific model, pass `-m` through to codex.

**Reasoning effort** varies by mode — use the right level for each task:
- **Review mode:** `high` — thorough but not slow. Diff review benefits from depth but doesn't need maximum compute.
- **Challenge (adversarial) mode:** `xhigh` — maximum reasoning power. When trying to break code, you want the model thinking as hard as possible.
- **Consult mode:** `high` — good balance of depth and speed for conversations.
**Reasoning effort:** All modes use `xhigh` — maximum reasoning power. When reviewing code, breaking code, or consulting on architecture, you want the model thinking as hard as possible.

**Web search:** All codex commands use `--enable web_search_cached` so Codex can look up
docs and APIs during review. This is OpenAI's cached index — fast, no extra cost.

M plan-ceo-review/SKILL.md => plan-ceo-review/SKILL.md +1 -1
@@ 1001,7 1001,7 @@ Parse the output. Find the most recent entry for each skill (plan-ceo-review, pl
- **Eng Review (required by default):** The only review that gates shipping. Covers architecture, code quality, tests, performance. Can be disabled globally with \`gstack-config set skip_eng_review true\` (the "don't bother me" setting).
- **CEO Review (optional):** Use your judgment. Recommend it for big product/business changes, new user-facing features, or scope decisions. Skip for bug fixes, refactors, infra, and cleanup.
- **Design Review (optional):** Use your judgment. Recommend it for UI/UX changes. Skip for backend-only, infra, or prompt-only changes.
- **Codex Review (optional):** Independent second opinion from OpenAI Codex CLI. Shows pass/fail gate. Recommend for critical code changes where a second AI perspective adds value. Skip when Codex CLI is not installed.
- **Codex Review (enabled by default when Codex CLI is installed):** Independent review + adversarial challenge from OpenAI Codex CLI. Shows pass/fail gate. Runs automatically when enabled — configure with \`gstack-config set codex_reviews enabled|disabled\`.

**Verdict logic:**
- **CLEARED**: Eng Review has >= 1 entry within 7 days with status "clean" (or \`skip_eng_review\` is \`true\`)

M plan-design-review/SKILL.md => plan-design-review/SKILL.md +1 -1
@@ 536,7 536,7 @@ Parse the output. Find the most recent entry for each skill (plan-ceo-review, pl
- **Eng Review (required by default):** The only review that gates shipping. Covers architecture, code quality, tests, performance. Can be disabled globally with \`gstack-config set skip_eng_review true\` (the "don't bother me" setting).
- **CEO Review (optional):** Use your judgment. Recommend it for big product/business changes, new user-facing features, or scope decisions. Skip for bug fixes, refactors, infra, and cleanup.
- **Design Review (optional):** Use your judgment. Recommend it for UI/UX changes. Skip for backend-only, infra, or prompt-only changes.
- **Codex Review (optional):** Independent second opinion from OpenAI Codex CLI. Shows pass/fail gate. Recommend for critical code changes where a second AI perspective adds value. Skip when Codex CLI is not installed.
- **Codex Review (enabled by default when Codex CLI is installed):** Independent review + adversarial challenge from OpenAI Codex CLI. Shows pass/fail gate. Runs automatically when enabled — configure with \`gstack-config set codex_reviews enabled|disabled\`.

**Verdict logic:**
- **CLEARED**: Eng Review has >= 1 entry within 7 days with status "clean" (or \`skip_eng_review\` is \`true\`)

M plan-eng-review/SKILL.md => plan-eng-review/SKILL.md +1 -1
@@ 526,7 526,7 @@ Parse the output. Find the most recent entry for each skill (plan-ceo-review, pl
- **Eng Review (required by default):** The only review that gates shipping. Covers architecture, code quality, tests, performance. Can be disabled globally with \`gstack-config set skip_eng_review true\` (the "don't bother me" setting).
- **CEO Review (optional):** Use your judgment. Recommend it for big product/business changes, new user-facing features, or scope decisions. Skip for bug fixes, refactors, infra, and cleanup.
- **Design Review (optional):** Use your judgment. Recommend it for UI/UX changes. Skip for backend-only, infra, or prompt-only changes.
- **Codex Review (optional):** Independent second opinion from OpenAI Codex CLI. Shows pass/fail gate. Recommend for critical code changes where a second AI perspective adds value. Skip when Codex CLI is not installed.
- **Codex Review (enabled by default when Codex CLI is installed):** Independent review + adversarial challenge from OpenAI Codex CLI. Shows pass/fail gate. Runs automatically when enabled — configure with \`gstack-config set codex_reviews enabled|disabled\`.

**Verdict logic:**
- **CLEARED**: Eng Review has >= 1 entry within 7 days with status "clean" (or \`skip_eng_review\` is \`true\`)

M review/SKILL.md => review/SKILL.md +100 -24
@@ 483,52 483,128 @@ If no documentation files exist, skip this step silently.

---

## Step 5.7: Codex second opinion (optional)
## Step 5.7: Codex review

After completing the review, check if the Codex CLI is available:
Check if the Codex CLI is available and read the user's Codex review preference:

```bash
which codex 2>/dev/null && echo "CODEX_AVAILABLE" || echo "CODEX_NOT_AVAILABLE"
CODEX_REVIEWS_CFG=$(~/.claude/skills/gstack/bin/gstack-config get codex_reviews 2>/dev/null || true)
echo "CODEX_REVIEWS: ${CODEX_REVIEWS_CFG:-not_set}"
```

If Codex is available, use AskUserQuestion:
If `CODEX_NOT_AVAILABLE`: skip this step silently. Continue to the next step.

If `CODEX_REVIEWS` is `disabled`: skip this step silently. Continue to the next step.

If `CODEX_REVIEWS` is `enabled`: run both code review and adversarial challenge automatically (no prompt). Jump to the "Run Codex" section below.

If `CODEX_REVIEWS` is `not_set`: use AskUserQuestion to offer the one-time adoption prompt:

```
Review complete. Want an independent second opinion from Codex (OpenAI)?
GStack recommends enabling Codex code reviews — Codex is the super smart quiet engineer friend who will save your butt.

A) Run Codex code review — independent diff review with pass/fail gate
B) Run Codex adversarial challenge — try to find ways this code will fail in production
C) Both — review first, then adversarial challenge
D) Skip — no Codex review needed
A) Enable for all future runs (recommended, default)
B) Try it for now, ask me again later
C) No thanks, don't ask me again
```

If the user chooses A, B, or C:
If the user chooses A: persist the setting and run both:
```bash
~/.claude/skills/gstack/bin/gstack-config set codex_reviews enabled
```

**For code review (A or C):** Run `codex review --base <base>` with a 5-minute timeout.
Present the full output verbatim under a `CODEX SAYS (code review):` header.
Check the output for `[P1]` markers — if found, note `GATE: FAIL`, otherwise `GATE: PASS`.
After presenting, compare Codex's findings with your own review findings from Steps 4-5
and output a CROSS-MODEL ANALYSIS showing what both found, what only Codex found,
and what only Claude found.
If the user chooses B: run both this time but do not persist any setting.

**For adversarial challenge (B or C):** Run:
If the user chooses C: persist the opt-out and skip:
```bash
codex exec "Review the changes on this branch against the base branch. Run git diff origin/<base> to see the diff. Your job is to find ways this code will fail in production. Think like an attacker and a chaos engineer. Find edge cases, race conditions, security holes, failure modes. Be adversarial." -s read-only
~/.claude/skills/gstack/bin/gstack-config set codex_reviews disabled
```
Present the full output verbatim under a `CODEX SAYS (adversarial challenge):` header.
Then skip this step. Continue to the next step.

### Run Codex

**Only if a code review ran (user chose A or C):** Persist the Codex review result to the review log:
Always run **both** code review and adversarial challenge. Use a 5-minute timeout (`timeout: 300000`) on each Bash call.

First, create a temp file for stderr capture:
```bash
~/.claude/skills/gstack/bin/gstack-review-log '{"skill":"codex-review","timestamp":"'"$(date -u +%Y-%m-%dT%H:%M:%SZ)"'","status":"STATUS","gate":"GATE"}'
TMPERR=$(mktemp /tmp/codex-review-XXXXXXXX)
```

**Code review:** Run:
```bash
codex review --base <base> -c 'model_reasoning_effort="xhigh"' --enable web_search_cached 2>"$TMPERR"
```

After the command completes, read stderr for cost/error info:
```bash
cat "$TMPERR"
```

Present the full output verbatim under a `CODEX SAYS (code review):` header:

```
CODEX SAYS (code review):
════════════════════════════════════════════════════════════
<full codex output, verbatim — do not truncate or summarize>
════════════════════════════════════════════════════════════
GATE: PASS                    Tokens: N | Est. cost: ~$X.XX
```

Check the output for `[P1]` markers. If found: `GATE: FAIL`. If no `[P1]`: `GATE: PASS`.

**If GATE is FAIL:** use AskUserQuestion:

```
Codex found N critical issues in the diff.

A) Investigate and fix now (recommended)
B) Ship anyway — these issues may cause production problems
```

If the user chooses A: read the Codex findings carefully and work to address them. Then re-run `codex review` to verify the gate is now PASS.

If the user chooses B: continue to the next step.

### Error handling (code review)

Before persisting the gate result, check for errors. All errors are non-blocking — Codex is a quality enhancement, not a prerequisite. Check `$TMPERR` output (already read above) for error indicators:

- **Auth failure:** If stderr contains "auth", "login", "unauthorized", or "API key", tell the user: "Codex authentication failed. Run \`codex login\` in your terminal to authenticate via ChatGPT." Do NOT persist a review log entry. Continue to the adversarial step (it will likely fail too, but try anyway).
- **Timeout:** If the Bash call times out (5 min), tell the user: "Codex timed out after 5 minutes. The diff may be too large or the API may be slow." Do NOT persist a review log entry. Skip to cleanup.
- **Empty response:** If codex returned no stdout output, tell the user: "Codex returned no response. Stderr: <paste relevant error>." Do NOT persist a review log entry. Skip to cleanup.

**Only if codex produced a real review (non-empty stdout):** Persist the code review result:
```bash
~/.claude/skills/gstack/bin/gstack-review-log '{"skill":"codex-review","timestamp":"'"$(date -u +%Y-%m-%dT%H:%M:%SZ)"'","status":"STATUS","gate":"GATE","commit":"'"$(git rev-parse --short HEAD)"'"}'
```

Substitute: STATUS ("clean" if PASS, "issues_found" if FAIL), GATE ("pass" or "fail").

**Do NOT persist a codex-review entry when only the adversarial challenge (B) ran** —
there is no gate verdict to record, and a false entry would make the Review Readiness
Dashboard believe a code review happened when it didn't.
**Adversarial challenge:** Run:
```bash
TMPERR_ADV=$(mktemp /tmp/codex-adv-XXXXXXXX)
codex exec "Review the changes on this branch against the base branch. Run git diff origin/<base> to see the diff. Your job is to find ways this code will fail in production. Think like an attacker and a chaos engineer. Find edge cases, race conditions, security holes, resource leaks, failure modes, and silent data corruption paths. Be adversarial. Be thorough. No compliments — just the problems." -s read-only -c 'model_reasoning_effort="xhigh"' --enable web_search_cached 2>"$TMPERR_ADV"
```

After the command completes, read adversarial stderr:
```bash
cat "$TMPERR_ADV"
```

Present the full output verbatim under a `CODEX SAYS (adversarial challenge):` header. This is informational — it never blocks shipping. If the adversarial command timed out or returned no output, note this to the user and continue.

**Cross-model analysis:** After both Codex outputs are presented, compare Codex's findings with your own review findings from the earlier review steps and output:

```
CROSS-MODEL ANALYSIS:
  Both found: [findings that overlap between Claude and Codex]
  Only Codex found: [findings unique to Codex]
  Only Claude found: [findings unique to Claude's review]
  Agreement rate: X% (N/M total unique findings overlap)
```

If Codex is not available, skip this step silently.
**Cleanup:** Run `rm -f "$TMPERR" "$TMPERR_ADV"` after processing.

---


M review/SKILL.md.tmpl => review/SKILL.md.tmpl +1 -48
@@ 231,54 231,7 @@ If no documentation files exist, skip this step silently.

---

## Step 5.7: Codex second opinion (optional)

After completing the review, check if the Codex CLI is available:

```bash
which codex 2>/dev/null && echo "CODEX_AVAILABLE" || echo "CODEX_NOT_AVAILABLE"
```

If Codex is available, use AskUserQuestion:

```
Review complete. Want an independent second opinion from Codex (OpenAI)?

A) Run Codex code review — independent diff review with pass/fail gate
B) Run Codex adversarial challenge — try to find ways this code will fail in production
C) Both — review first, then adversarial challenge
D) Skip — no Codex review needed
```

If the user chooses A, B, or C:

**For code review (A or C):** Run `codex review --base <base>` with a 5-minute timeout.
Present the full output verbatim under a `CODEX SAYS (code review):` header.
Check the output for `[P1]` markers — if found, note `GATE: FAIL`, otherwise `GATE: PASS`.
After presenting, compare Codex's findings with your own review findings from Steps 4-5
and output a CROSS-MODEL ANALYSIS showing what both found, what only Codex found,
and what only Claude found.

**For adversarial challenge (B or C):** Run:
```bash
codex exec "Review the changes on this branch against the base branch. Run git diff origin/<base> to see the diff. Your job is to find ways this code will fail in production. Think like an attacker and a chaos engineer. Find edge cases, race conditions, security holes, failure modes. Be adversarial." -s read-only
```
Present the full output verbatim under a `CODEX SAYS (adversarial challenge):` header.

**Only if a code review ran (user chose A or C):** Persist the Codex review result to the review log:
```bash
~/.claude/skills/gstack/bin/gstack-review-log '{"skill":"codex-review","timestamp":"'"$(date -u +%Y-%m-%dT%H:%M:%SZ)"'","status":"STATUS","gate":"GATE"}'
```

Substitute: STATUS ("clean" if PASS, "issues_found" if FAIL), GATE ("pass" or "fail").

**Do NOT persist a codex-review entry when only the adversarial challenge (B) ran** —
there is no gate verdict to record, and a false entry would make the Review Readiness
Dashboard believe a code review happened when it didn't.

If Codex is not available, skip this step silently.

---
{{CODEX_REVIEW_STEP}}

## Important Rules


M scripts/gen-skill-docs.ts => scripts/gen-skill-docs.ts +135 -1
@@ 1092,7 1092,7 @@ Parse the output. Find the most recent entry for each skill (plan-ceo-review, pl
- **Eng Review (required by default):** The only review that gates shipping. Covers architecture, code quality, tests, performance. Can be disabled globally with \\\`gstack-config set skip_eng_review true\\\` (the "don't bother me" setting).
- **CEO Review (optional):** Use your judgment. Recommend it for big product/business changes, new user-facing features, or scope decisions. Skip for bug fixes, refactors, infra, and cleanup.
- **Design Review (optional):** Use your judgment. Recommend it for UI/UX changes. Skip for backend-only, infra, or prompt-only changes.
- **Codex Review (optional):** Independent second opinion from OpenAI Codex CLI. Shows pass/fail gate. Recommend for critical code changes where a second AI perspective adds value. Skip when Codex CLI is not installed.
- **Codex Review (enabled by default when Codex CLI is installed):** Independent review + adversarial challenge from OpenAI Codex CLI. Shows pass/fail gate. Runs automatically when enabled — configure with \\\`gstack-config set codex_reviews enabled|disabled\\\`.

**Verdict logic:**
- **CLEARED**: Eng Review has >= 1 entry within 7 days with status "clean" (or \\\`skip_eng_review\\\` is \\\`true\\\`)


@@ 1412,6 1412,139 @@ The screenshot file at \`/tmp/gstack-sketch.png\` can be referenced by downstrea
(\`/plan-design-review\`, \`/design-review\`) to see what was originally envisioned.`;
}

function generateCodexReviewStep(ctx: TemplateContext): string {
  // Codex host: strip entirely — Codex should never invoke itself
  if (ctx.host === 'codex') return '';

  const isShip = ctx.skillName === 'ship';
  const stepNum = isShip ? '3.8' : '5.7';

  return `## Step ${stepNum}: Codex review

Check if the Codex CLI is available and read the user's Codex review preference:

\`\`\`bash
which codex 2>/dev/null && echo "CODEX_AVAILABLE" || echo "CODEX_NOT_AVAILABLE"
CODEX_REVIEWS_CFG=$(~/.claude/skills/gstack/bin/gstack-config get codex_reviews 2>/dev/null || true)
echo "CODEX_REVIEWS: \${CODEX_REVIEWS_CFG:-not_set}"
\`\`\`

If \`CODEX_NOT_AVAILABLE\`: skip this step silently. Continue to the next step.

If \`CODEX_REVIEWS\` is \`disabled\`: skip this step silently. Continue to the next step.

If \`CODEX_REVIEWS\` is \`enabled\`: run both code review and adversarial challenge automatically (no prompt). Jump to the "Run Codex" section below.

If \`CODEX_REVIEWS\` is \`not_set\`: use AskUserQuestion to offer the one-time adoption prompt:

\`\`\`
GStack recommends enabling Codex code reviews — Codex is the super smart quiet engineer friend who will save your butt.

A) Enable for all future runs (recommended, default)
B) Try it for now, ask me again later
C) No thanks, don't ask me again
\`\`\`

If the user chooses A: persist the setting and run both:
\`\`\`bash
~/.claude/skills/gstack/bin/gstack-config set codex_reviews enabled
\`\`\`

If the user chooses B: run both this time but do not persist any setting.

If the user chooses C: persist the opt-out and skip:
\`\`\`bash
~/.claude/skills/gstack/bin/gstack-config set codex_reviews disabled
\`\`\`
Then skip this step. Continue to the next step.

### Run Codex

Always run **both** code review and adversarial challenge. Use a 5-minute timeout (\`timeout: 300000\`) on each Bash call.

First, create a temp file for stderr capture:
\`\`\`bash
TMPERR=$(mktemp /tmp/codex-review-XXXXXXXX)
\`\`\`

**Code review:** Run:
\`\`\`bash
codex review --base <base> -c 'model_reasoning_effort="xhigh"' --enable web_search_cached 2>"$TMPERR"
\`\`\`

After the command completes, read stderr for cost/error info:
\`\`\`bash
cat "$TMPERR"
\`\`\`

Present the full output verbatim under a \`CODEX SAYS (code review):\` header:

\`\`\`
CODEX SAYS (code review):
════════════════════════════════════════════════════════════
<full codex output, verbatim — do not truncate or summarize>
════════════════════════════════════════════════════════════
GATE: PASS                    Tokens: N | Est. cost: ~$X.XX
\`\`\`

Check the output for \`[P1]\` markers. If found: \`GATE: FAIL\`. If no \`[P1]\`: \`GATE: PASS\`.

**If GATE is FAIL:** use AskUserQuestion:

\`\`\`
Codex found N critical issues in the diff.

A) Investigate and fix now (recommended)
B) Ship anyway — these issues may cause production problems
\`\`\`

If the user chooses A: read the Codex findings carefully and work to address them${isShip ? '. After fixing, re-run tests (Step 3) since code has changed' : ''}. Then re-run \`codex review\` to verify the gate is now PASS.

If the user chooses B: continue to the next step.

### Error handling (code review)

Before persisting the gate result, check for errors. All errors are non-blocking — Codex is a quality enhancement, not a prerequisite. Check \`$TMPERR\` output (already read above) for error indicators:

- **Auth failure:** If stderr contains "auth", "login", "unauthorized", or "API key", tell the user: "Codex authentication failed. Run \\\`codex login\\\` in your terminal to authenticate via ChatGPT." Do NOT persist a review log entry. Continue to the adversarial step (it will likely fail too, but try anyway).
- **Timeout:** If the Bash call times out (5 min), tell the user: "Codex timed out after 5 minutes. The diff may be too large or the API may be slow." Do NOT persist a review log entry. Skip to cleanup.
- **Empty response:** If codex returned no stdout output, tell the user: "Codex returned no response. Stderr: <paste relevant error>." Do NOT persist a review log entry. Skip to cleanup.

**Only if codex produced a real review (non-empty stdout):** Persist the code review result:
\`\`\`bash
~/.claude/skills/gstack/bin/gstack-review-log '{"skill":"codex-review","timestamp":"'"$(date -u +%Y-%m-%dT%H:%M:%SZ)"'","status":"STATUS","gate":"GATE","commit":"'"$(git rev-parse --short HEAD)"'"}'
\`\`\`

Substitute: STATUS ("clean" if PASS, "issues_found" if FAIL), GATE ("pass" or "fail").

**Adversarial challenge:** Run:
\`\`\`bash
TMPERR_ADV=$(mktemp /tmp/codex-adv-XXXXXXXX)
codex exec "Review the changes on this branch against the base branch. Run git diff origin/<base> to see the diff. Your job is to find ways this code will fail in production. Think like an attacker and a chaos engineer. Find edge cases, race conditions, security holes, resource leaks, failure modes, and silent data corruption paths. Be adversarial. Be thorough. No compliments — just the problems." -s read-only -c 'model_reasoning_effort="xhigh"' --enable web_search_cached 2>"$TMPERR_ADV"
\`\`\`

After the command completes, read adversarial stderr:
\`\`\`bash
cat "$TMPERR_ADV"
\`\`\`

Present the full output verbatim under a \`CODEX SAYS (adversarial challenge):\` header. This is informational — it never blocks shipping. If the adversarial command timed out or returned no output, note this to the user and continue.
${!isShip ? `
**Cross-model analysis:** After both Codex outputs are presented, compare Codex's findings with your own review findings from the earlier review steps and output:

\`\`\`
CROSS-MODEL ANALYSIS:
  Both found: [findings that overlap between Claude and Codex]
  Only Codex found: [findings unique to Codex]
  Only Claude found: [findings unique to Claude's review]
  Agreement rate: X% (N/M total unique findings overlap)
\`\`\`
` : ''}
**Cleanup:** Run \`rm -f "$TMPERR" "$TMPERR_ADV"\` after processing.

---`;
}

const RESOLVERS: Record<string, (ctx: TemplateContext) => string> = {
  COMMAND_REFERENCE: generateCommandReference,
  SNAPSHOT_FLAGS: generateSnapshotFlags,


@@ 1426,6 1559,7 @@ const RESOLVERS: Record<string, (ctx: TemplateContext) => string> = {
  SPEC_REVIEW_LOOP: generateSpecReviewLoop,
  DESIGN_SKETCH: generateDesignSketch,
  BENEFITS_FROM: generateBenefitsFrom,
  CODEX_REVIEW_STEP: generateCodexReviewStep,
};

// ─── Codex Helpers ───────────────────────────────────────────

M ship/SKILL.md => ship/SKILL.md +96 -19
@@ 305,7 305,7 @@ Parse the output. Find the most recent entry for each skill (plan-ceo-review, pl
- **Eng Review (required by default):** The only review that gates shipping. Covers architecture, code quality, tests, performance. Can be disabled globally with \`gstack-config set skip_eng_review true\` (the "don't bother me" setting).
- **CEO Review (optional):** Use your judgment. Recommend it for big product/business changes, new user-facing features, or scope decisions. Skip for bug fixes, refactors, infra, and cleanup.
- **Design Review (optional):** Use your judgment. Recommend it for UI/UX changes. Skip for backend-only, infra, or prompt-only changes.
- **Codex Review (optional):** Independent second opinion from OpenAI Codex CLI. Shows pass/fail gate. Recommend for critical code changes where a second AI perspective adds value. Skip when Codex CLI is not installed.
- **Codex Review (enabled by default when Codex CLI is installed):** Independent review + adversarial challenge from OpenAI Codex CLI. Shows pass/fail gate. Runs automatically when enabled — configure with \`gstack-config set codex_reviews enabled|disabled\`.

**Verdict logic:**
- **CLEARED**: Eng Review has >= 1 entry within 7 days with status "clean" (or \`skip_eng_review\` is \`true\`)


@@ 847,41 847,118 @@ For each classified comment:

---

## Step 3.8: Codex second opinion (optional)
## Step 3.8: Codex review

Check if the Codex CLI is available:
Check if the Codex CLI is available and read the user's Codex review preference:

```bash
which codex 2>/dev/null && echo "CODEX_AVAILABLE" || echo "CODEX_NOT_AVAILABLE"
CODEX_REVIEWS_CFG=$(~/.claude/skills/gstack/bin/gstack-config get codex_reviews 2>/dev/null || true)
echo "CODEX_REVIEWS: ${CODEX_REVIEWS_CFG:-not_set}"
```

If Codex is available, use AskUserQuestion:
If `CODEX_NOT_AVAILABLE`: skip this step silently. Continue to the next step.

If `CODEX_REVIEWS` is `disabled`: skip this step silently. Continue to the next step.

If `CODEX_REVIEWS` is `enabled`: run both code review and adversarial challenge automatically (no prompt). Jump to the "Run Codex" section below.

If `CODEX_REVIEWS` is `not_set`: use AskUserQuestion to offer the one-time adoption prompt:

```
GStack recommends enabling Codex code reviews — Codex is the super smart quiet engineer friend who will save your butt.

A) Enable for all future runs (recommended, default)
B) Try it for now, ask me again later
C) No thanks, don't ask me again
```

If the user chooses A: persist the setting and run both:
```bash
~/.claude/skills/gstack/bin/gstack-config set codex_reviews enabled
```

If the user chooses B: run both this time but do not persist any setting.

If the user chooses C: persist the opt-out and skip:
```bash
~/.claude/skills/gstack/bin/gstack-config set codex_reviews disabled
```
Then skip this step. Continue to the next step.

### Run Codex

Always run **both** code review and adversarial challenge. Use a 5-minute timeout (`timeout: 300000`) on each Bash call.

First, create a temp file for stderr capture:
```bash
TMPERR=$(mktemp /tmp/codex-review-XXXXXXXX)
```

**Code review:** Run:
```bash
codex review --base <base> -c 'model_reasoning_effort="xhigh"' --enable web_search_cached 2>"$TMPERR"
```
Pre-landing review complete. Want an independent Codex (OpenAI) review before shipping?

A) Run Codex code review — independent diff review with pass/fail gate
B) Run Codex adversarial challenge — try to break this code
C) Skip — ship without Codex review
After the command completes, read stderr for cost/error info:
```bash
cat "$TMPERR"
```

Present the full output verbatim under a `CODEX SAYS (code review):` header:

```
CODEX SAYS (code review):
════════════════════════════════════════════════════════════
<full codex output, verbatim — do not truncate or summarize>
════════════════════════════════════════════════════════════
GATE: PASS                    Tokens: N | Est. cost: ~$X.XX
```

Check the output for `[P1]` markers. If found: `GATE: FAIL`. If no `[P1]`: `GATE: PASS`.

**If GATE is FAIL:** use AskUserQuestion:

```
Codex found N critical issues in the diff.

A) Investigate and fix now (recommended)
B) Ship anyway — these issues may cause production problems
```

If the user chooses A or B:
If the user chooses A: read the Codex findings carefully and work to address them. After fixing, re-run tests (Step 3) since code has changed. Then re-run `codex review` to verify the gate is now PASS.

If the user chooses B: continue to the next step.

### Error handling (code review)

Before persisting the gate result, check for errors. All errors are non-blocking — Codex is a quality enhancement, not a prerequisite. Check `$TMPERR` output (already read above) for error indicators:

- **Auth failure:** If stderr contains "auth", "login", "unauthorized", or "API key", tell the user: "Codex authentication failed. Run \`codex login\` in your terminal to authenticate via ChatGPT." Do NOT persist a review log entry. Continue to the adversarial step (it will likely fail too, but try anyway).
- **Timeout:** If the Bash call times out (5 min), tell the user: "Codex timed out after 5 minutes. The diff may be too large or the API may be slow." Do NOT persist a review log entry. Skip to cleanup.
- **Empty response:** If codex returned no stdout output, tell the user: "Codex returned no response. Stderr: <paste relevant error>." Do NOT persist a review log entry. Skip to cleanup.

**For code review (A):** Run `codex review --base <base>` with a 5-minute timeout.
Present the full output verbatim under a `CODEX SAYS:` header. Check for `[P1]` markers
to determine pass/fail gate. Persist the result:
**Only if codex produced a real review (non-empty stdout):** Persist the code review result:
```bash
~/.claude/skills/gstack/bin/gstack-review-log '{"skill":"codex-review","timestamp":"'"$(date -u +%Y-%m-%dT%H:%M:%SZ)"'","status":"STATUS","gate":"GATE","commit":"'"$(git rev-parse --short HEAD)"'"}'
```

Substitute: STATUS ("clean" if PASS, "issues_found" if FAIL), GATE ("pass" or "fail").

**Adversarial challenge:** Run:
```bash
~/.claude/skills/gstack/bin/gstack-review-log '{"skill":"codex-review","timestamp":"TIMESTAMP","status":"STATUS","gate":"GATE"}'
TMPERR_ADV=$(mktemp /tmp/codex-adv-XXXXXXXX)
codex exec "Review the changes on this branch against the base branch. Run git diff origin/<base> to see the diff. Your job is to find ways this code will fail in production. Think like an attacker and a chaos engineer. Find edge cases, race conditions, security holes, resource leaks, failure modes, and silent data corruption paths. Be adversarial. Be thorough. No compliments — just the problems." -s read-only -c 'model_reasoning_effort="xhigh"' --enable web_search_cached 2>"$TMPERR_ADV"
```

If GATE is FAIL, use AskUserQuestion: "Codex found critical issues. Ship anyway?"
If the user says no, stop. If yes, continue to Step 4.
After the command completes, read adversarial stderr:
```bash
cat "$TMPERR_ADV"
```

**For adversarial (B):** Run codex exec with the adversarial prompt (see /codex skill).
Present findings. This is informational — does not block shipping.
Present the full output verbatim under a `CODEX SAYS (adversarial challenge):` header. This is informational — it never blocks shipping. If the adversarial command timed out or returned no output, note this to the user and continue.

If Codex is not available, skip silently. Continue to Step 4.
**Cleanup:** Run `rm -f "$TMPERR" "$TMPERR_ADV"` after processing.

---



@@ 1124,7 1201,7 @@ doc updates — the user runs `/ship` and documentation stays current without a 
- **Never skip tests.** If tests fail, stop.
- **Never skip the pre-landing review.** If checklist.md is unreadable, stop.
- **Never force push.** Use regular `git push` only.
- **Never ask for confirmation** except for MINOR/MAJOR version bumps and pre-landing review ASK items (batched into at most one AskUserQuestion).
- **Never ask for trivial confirmations** (e.g., "ready to push?", "create PR?"). DO stop for: version bumps (MINOR/MAJOR), pre-landing review findings (ASK items), Codex critical findings ([P1]), and the one-time Codex adoption prompt.
- **Always use the 4-digit version format** from the VERSION file.
- **Date format in CHANGELOG:** `YYYY-MM-DD`
- **Split commits for bisectability** — each commit = one logical change.

M ship/SKILL.md.tmpl => ship/SKILL.md.tmpl +2 -38
@@ 403,43 403,7 @@ For each classified comment:

---

## Step 3.8: Codex second opinion (optional)

Check if the Codex CLI is available:

```bash
which codex 2>/dev/null && echo "CODEX_AVAILABLE" || echo "CODEX_NOT_AVAILABLE"
```

If Codex is available, use AskUserQuestion:

```
Pre-landing review complete. Want an independent Codex (OpenAI) review before shipping?

A) Run Codex code review — independent diff review with pass/fail gate
B) Run Codex adversarial challenge — try to break this code
C) Skip — ship without Codex review
```

If the user chooses A or B:

**For code review (A):** Run `codex review --base <base>` with a 5-minute timeout.
Present the full output verbatim under a `CODEX SAYS:` header. Check for `[P1]` markers
to determine pass/fail gate. Persist the result:

```bash
~/.claude/skills/gstack/bin/gstack-review-log '{"skill":"codex-review","timestamp":"TIMESTAMP","status":"STATUS","gate":"GATE"}'
```

If GATE is FAIL, use AskUserQuestion: "Codex found critical issues. Ship anyway?"
If the user says no, stop. If yes, continue to Step 4.

**For adversarial (B):** Run codex exec with the adversarial prompt (see /codex skill).
Present findings. This is informational — does not block shipping.

If Codex is not available, skip silently. Continue to Step 4.

---
{{CODEX_REVIEW_STEP}}

## Step 4: Version bump (auto-decide)



@@ 680,7 644,7 @@ doc updates — the user runs `/ship` and documentation stays current without a 
- **Never skip tests.** If tests fail, stop.
- **Never skip the pre-landing review.** If checklist.md is unreadable, stop.
- **Never force push.** Use regular `git push` only.
- **Never ask for confirmation** except for MINOR/MAJOR version bumps and pre-landing review ASK items (batched into at most one AskUserQuestion).
- **Never ask for trivial confirmations** (e.g., "ready to push?", "create PR?"). DO stop for: version bumps (MINOR/MAJOR), pre-landing review findings (ASK items), Codex critical findings ([P1]), and the one-time Codex adoption prompt.
- **Always use the 4-digit version format** from the VERSION file.
- **Date format in CHANGELOG:** `YYYY-MM-DD`
- **Split commits for bisectability** — each commit = one logical change.

M test/gen-skill-docs.test.ts => test/gen-skill-docs.test.ts +10 -0
@@ 584,6 584,16 @@ describe('Codex generation (--host codex)', () => {
    expect(fs.existsSync(path.join(AGENTS_DIR, 'gstack-codex'))).toBe(false);
  });

  test('Codex review step stripped from Codex-host ship and review', () => {
    const shipContent = fs.readFileSync(path.join(AGENTS_DIR, 'gstack-ship', 'SKILL.md'), 'utf-8');
    expect(shipContent).not.toContain('codex review --base');
    expect(shipContent).not.toContain('Investigate and fix');

    const reviewContent = fs.readFileSync(path.join(AGENTS_DIR, 'gstack-review', 'SKILL.md'), 'utf-8');
    expect(reviewContent).not.toContain('codex review --base');
    expect(reviewContent).not.toContain('Investigate and fix');
  });

  test('--host codex --dry-run freshness', () => {
    const result = Bun.spawnSync(['bun', 'run', 'scripts/gen-skill-docs.ts', '--host', 'codex', '--dry-run'], {
      cwd: ROOT,

M test/skill-validation.test.ts => test/skill-validation.test.ts +22 -4
@@ 1256,18 1256,36 @@ describe('Codex skill', () => {
    expect(content).toContain('mktemp');
  });

  test('codex integration in /review offers second opinion', () => {
  test('codex integration in /review has config-driven review step', () => {
    const content = fs.readFileSync(path.join(ROOT, 'review', 'SKILL.md'), 'utf-8');
    expect(content).toContain('Codex second opinion');
    expect(content).toContain('Codex review');
    expect(content).toContain('codex_reviews');
    expect(content).toContain('codex review');
    expect(content).toContain('adversarial');
    expect(content).toContain('xhigh');
    expect(content).toContain('Investigate and fix');
    expect(content).toContain('CROSS-MODEL');
  });

  test('codex integration in /ship offers review gate', () => {
  test('codex integration in /ship has config-driven review step', () => {
    const content = fs.readFileSync(path.join(ROOT, 'ship', 'SKILL.md'), 'utf-8');
    expect(content).toContain('Codex');
    expect(content).toContain('Codex review');
    expect(content).toContain('codex_reviews');
    expect(content).toContain('codex review');
    expect(content).toContain('codex-review');
    expect(content).toContain('xhigh');
    expect(content).toContain('Investigate and fix');
  });

  test('codex-host ship/review do NOT contain codex review step', () => {
    const shipContent = fs.readFileSync(path.join(ROOT, '.agents', 'skills', 'gstack-ship', 'SKILL.md'), 'utf-8');
    expect(shipContent).not.toContain('codex review --base');
    expect(shipContent).not.toContain('Investigate and fix');

    const reviewContent = fs.readFileSync(path.join(ROOT, '.agents', 'skills', 'gstack-review', 'SKILL.md'), 'utf-8');
    expect(reviewContent).not.toContain('codex review --base');
    expect(reviewContent).not.toContain('codex_reviews');
    expect(reviewContent).not.toContain('Investigate and fix');
  });

  test('codex integration in /plan-eng-review offers plan critique', () => {

M test/touchfiles.test.ts => test/touchfiles.test.ts +3 -2
@@ 78,8 78,9 @@ describe('selectTests', () => {
    const result = selectTests(['plan-ceo-review/SKILL.md'], E2E_TOUCHFILES);
    expect(result.selected).toContain('plan-ceo-review');
    expect(result.selected).toContain('plan-ceo-review-selective');
    expect(result.selected.length).toBe(2);
    expect(result.skipped.length).toBe(Object.keys(E2E_TOUCHFILES).length - 2);
    expect(result.selected).toContain('plan-ceo-review-benefits');
    expect(result.selected.length).toBe(3);
    expect(result.skipped.length).toBe(Object.keys(E2E_TOUCHFILES).length - 3);
  });

  test('global touchfile triggers ALL tests', () => {