~cytrogen/gstack: docs: v0.9.8.0 — deploy pipeline docs + pre-merge readiness gate (#306)

8 files changed, 577 insertions(+), 24 deletions(-)

M .agents/skills/gstack-land-and-deploy/SKILL.md
M CHANGELOG.md
M README.md
M TODOS.md
M VERSION
M land-and-deploy/SKILL.md
M land-and-deploy/SKILL.md.tmpl
M package.json

M .agents/skills/gstack-land-and-deploy/SKILL.md => .agents/skills/gstack-land-and-deploy/SKILL.md +177 -4

@@ 286,11 286,14 @@ When the user types `/land-and-deploy`, run this skill.
 - `/land-and-deploy #123` — specific PR number
 - `/land-and-deploy #123 <url>` — specific PR + verification URL
 
-## Non-interactive philosophy (like /ship)
+## Non-interactive philosophy (like /ship) — with one critical gate
 
-This is a **non-interactive, fully automated** workflow. Do NOT ask for confirmation at any step except the ones listed below. The user said `/land-and-deploy` which means DO IT.
+This is a **mostly automated** workflow. Do NOT ask for confirmation at any step except
+the ones listed below. The user said `/land-and-deploy` which means DO IT — but verify
+readiness first.
 
-**Only stop for:**
+**Always stop for:**
+- **Pre-merge readiness gate (Step 3.5)** — this is the ONE confirmation before merge
 - GitHub CLI not authenticated
 - No PR found for this branch
 - CI failures or merge conflicts


@@ 300,7 303,6 @@ This is a **non-interactive, fully automated** workflow. Do NOT ask for confirma
 
 **Never stop for:**
 - Choosing merge method (auto-detect from repo settings)
-- Confirming the merge
 - Timeout warnings (warn and continue gracefully)
 
 ---


@@ 365,6 367,177 @@ If timeout (15 min): **STOP.** "CI has been running for 15 minutes. Investigate 
 
 ---
 
+## Step 3.5: Pre-merge readiness gate
+
+**This is the critical safety check before an irreversible merge.** The merge cannot
+be undone without a revert commit. Gather ALL evidence, build a readiness report,
+and get explicit user confirmation before proceeding.
+
+Collect evidence for each check below. Track warnings (yellow) and blockers (red).
+
+### 3.5a: Review staleness check
+
+```bash
+~/.codex/skills/gstack/bin/gstack-review-read 2>/dev/null
+```
+
+Parse the output. For each review skill (plan-eng-review, plan-ceo-review,
+plan-design-review, design-review-lite, codex-review):
+
+1. Find the most recent entry within the last 7 days.
+2. Extract its `commit` field.
+3. Compare against current HEAD: `git rev-list --count STORED_COMMIT..HEAD`
+
+**Staleness rules:**
+- 0 commits since review → CURRENT
+- 1-3 commits since review → RECENT (yellow if those commits touch code, not just docs)
+- 4+ commits since review → STALE (red — review may not reflect current code)
+- No review found → NOT RUN
+
+**Critical check:** Look at what changed AFTER the last review. Run:
+```bash
+git log --oneline STORED_COMMIT..HEAD
+```
+If any commits after the review contain words like "fix", "refactor", "rewrite",
+"overhaul", or touch more than 5 files — flag as **STALE (significant changes
+since review)**. The review was done on different code than what's about to merge.
+
+### 3.5b: Test results
+
+**Free tests — run them now:**
+
+Read CLAUDE.md to find the project's test command. If not specified, use `bun test`.
+Run the test command and capture the exit code and output.
+
+```bash
+bun test 2>&1 | tail -10
+```
+
+If tests fail: **BLOCKER.** Cannot merge with failing tests.
+
+**E2E tests — check recent results:**
+
+```bash
+ls -t ~/.gstack-dev/evals/*-e2e-*-$(date +%Y-%m-%d)*.json 2>/dev/null | head -20
+```
+
+For each eval file from today, parse pass/fail counts. Show:
+- Total tests, pass count, fail count
+- How long ago the run finished (from file timestamp)
+- Total cost
+- Names of any failing tests
+
+If no E2E results from today: **WARNING — no E2E tests run today.**
+If E2E results exist but have failures: **WARNING — N tests failed.** List them.
+
+**LLM judge evals — check recent results:**
+
+```bash
+ls -t ~/.gstack-dev/evals/*-llm-judge-*-$(date +%Y-%m-%d)*.json 2>/dev/null | head -5
+```
+
+If found, parse and show pass/fail. If not found, note "No LLM evals run today."
+
+### 3.5c: PR body accuracy check
+
+Read the current PR body:
+```bash
+gh pr view --json body -q .body
+```
+
+Read the current diff summary:
+```bash
+git log --oneline $(gh pr view --json baseRefName -q .baseRefName 2>/dev/null || echo main)..HEAD | head -20
+```
+
+Compare the PR body against the actual commits. Check for:
+1. **Missing features** — commits that add significant functionality not mentioned in the PR
+2. **Stale descriptions** — PR body mentions things that were later changed or reverted
+3. **Wrong version** — PR title or body references a version that doesn't match VERSION file
+
+If the PR body looks stale or incomplete: **WARNING — PR body may not reflect current
+changes.** List what's missing or stale.
+
+### 3.5d: Document-release check
+
+Check if documentation was updated on this branch:
+
+```bash
+git log --oneline --all-match --grep="docs:" $(gh pr view --json baseRefName -q .baseRefName 2>/dev/null || echo main)..HEAD | head -5
+```
+
+Also check if key doc files were modified:
+```bash
+git diff --name-only $(gh pr view --json baseRefName -q .baseRefName 2>/dev/null || echo main)...HEAD -- README.md CHANGELOG.md ARCHITECTURE.md CONTRIBUTING.md CLAUDE.md VERSION
+```
+
+If CHANGELOG.md and VERSION were NOT modified on this branch and the diff includes
+new features (new files, new commands, new skills): **WARNING — /document-release
+likely not run. CHANGELOG and VERSION not updated despite new features.**
+
+If only docs changed (no code): skip this check.
+
+### 3.5e: Readiness report and confirmation
+
+Build the full readiness report:
+
+```
+╔══════════════════════════════════════════════════════════╗
+║              PRE-MERGE READINESS REPORT                  ║
+╠══════════════════════════════════════════════════════════╣
+║                                                          ║
+║  PR: #NNN — title                                        ║
+║  Branch: feature → main                                  ║
+║                                                          ║
+║  REVIEWS                                                 ║
+║  ├─ Eng Review:    CURRENT / STALE (N commits) / —       ║
+║  ├─ CEO Review:    CURRENT / — (optional)                ║
+║  ├─ Design Review: CURRENT / — (optional)                ║
+║  └─ Codex Review:  CURRENT / — (optional)                ║
+║                                                          ║
+║  TESTS                                                   ║
+║  ├─ Free tests:    PASS / FAIL (blocker)                 ║
+║  ├─ E2E tests:     52/52 pass (25 min ago) / NOT RUN     ║
+║  └─ LLM evals:     PASS / NOT RUN                        ║
+║                                                          ║
+║  DOCUMENTATION                                           ║
+║  ├─ CHANGELOG:     Updated / NOT UPDATED (warning)       ║
+║  ├─ VERSION:       0.9.8.0 / NOT BUMPED (warning)        ║
+║  └─ Doc release:   Run / NOT RUN (warning)               ║
+║                                                          ║
+║  PR BODY                                                 ║
+║  └─ Accuracy:      Current / STALE (warning)             ║
+║                                                          ║
+║  WARNINGS: N  |  BLOCKERS: N                             ║
+╚══════════════════════════════════════════════════════════╝
+```
+
+If there are BLOCKERS (failing free tests): list them and recommend B.
+If there are WARNINGS but no blockers: list each warning and recommend A if
+warnings are minor, or B if warnings are significant.
+If everything is green: recommend A.
+
+Use AskUserQuestion:
+
+- **Re-ground:** "About to merge PR #NNN (title) from branch X to Y. Here's the
+  readiness report." Show the report above.
+- List each warning and blocker explicitly.
+- **RECOMMENDATION:** Choose A if green. Choose B if there are significant warnings.
+  Choose C only if the user understands the risks.
+- A) Merge — readiness checks passed (Completeness: 10/10)
+- B) Don't merge yet — address the warnings first (Completeness: 10/10)
+- C) Merge anyway — I understand the risks (Completeness: 3/10)
+
+If the user chooses B: **STOP.** List exactly what needs to be done:
+- If reviews are stale: "Re-run /plan-eng-review (or /review) to review current code."
+- If E2E not run: "Run `bun run test:e2e` to verify."
+- If docs not updated: "Run /document-release to update documentation."
+- If PR body stale: "Update the PR body to reflect current changes."
+
+If the user chooses A or C: continue to Step 4.
+
+---
+
 ## Step 4: Merge the PR
 
 Record the start timestamp for timing data.

M CHANGELOG.md => CHANGELOG.md +23 -0

@@ 1,5 1,28 @@
 # Changelog
 
+## [0.9.8.0] - 2026-03-21 — Deploy Pipeline + E2E Performance
+
+### Added
+
+- **`/land-and-deploy` — merge, deploy, and verify in one command.** Takes over where `/ship` left off. Merges the PR, waits for CI and deploy workflows, then runs canary verification on your production URL. Auto-detects your deploy platform (Fly.io, Render, Vercel, Netlify, Heroku, GitHub Actions). Offers revert at every failure point. One command from "PR approved" to "verified in production."
+- **`/canary` — post-deploy monitoring loop.** Watches your live app for console errors, performance regressions, and page failures using the browse daemon. Takes periodic screenshots, compares against pre-deploy baselines, and alerts on anomalies. Run `/canary https://myapp.com --duration 10m` after any deploy.
+- **`/benchmark` — performance regression detection.** Establishes baselines for page load times, Core Web Vitals, and resource sizes. Compares before/after on every PR. Tracks performance trends over time. Catches the bundle size regressions that code review misses.
+- **`/setup-deploy` — one-time deploy configuration.** Detects your deploy platform, production URL, health check endpoints, and deploy status commands. Writes the config to CLAUDE.md so all future `/land-and-deploy` runs are fully automatic.
+- **`/review` now includes Performance & Bundle Impact analysis.** The informational review pass checks for heavy dependencies, missing lazy loading, synchronous script tags, and bundle size regressions. Catches moment.js-instead-of-date-fns before it ships.
+
+### Changed
+
+- **E2E tests now run 3-5x faster.** Structure tests default to Sonnet (5x faster, 5x cheaper). Quality tests (planted-bug detection, design quality, strategic review) stay on Opus. Full suite dropped from 50-80 minutes to ~15-25 minutes.
+- **`--retry 2` on all E2E tests.** Flaky tests get a second chance without masking real failures.
+- **`test:e2e:fast` tier.** Excludes the 8 slowest Opus quality tests for quick feedback (~5-7 minutes). Run `bun run test:e2e:fast` for rapid iteration.
+- **E2E timing telemetry.** Every test now records `first_response_ms`, `max_inter_turn_ms`, and `model` used. Wall-clock timing shows whether parallelism is actually working.
+
+### Fixed
+
+- **`plan-design-review-plan-mode` no longer races.** Each test gets its own isolated tmpdir — no more concurrent tests polluting each other's working directory.
+- **`ship-local-workflow` no longer wastes 6 of 15 turns.** Ship workflow steps are inlined in the test prompt instead of having the agent read the 700+ line SKILL.md at runtime.
+- **`design-consultation-core` no longer fails on synonym sections.** "Colors" matches "Color", "Type System" matches "Typography" — fuzzy synonym-based matching with all 7 sections still required.
+
 ## [0.9.7.0] - 2026-03-21 — Plan File Review Report
 
 ### Added

M README.md => README.md +11 -5

@@ 16,7 16,7 @@ In the last 60 days I have written **over 600,000 lines of production code** —
 
 Same person. Different era. The difference is the tooling.
 
-**gstack is how I do it.** It is my open source software factory. It turns Claude Code into a virtual engineering team you actually manage — a CEO who rethinks the product, an eng manager who locks the architecture, a designer who catches AI slop, a paranoid reviewer who finds production bugs, a QA lead who opens a real browser and clicks through your app, and a release engineer who ships the PR. Fifteen specialists and six power tools, all as slash commands, all Markdown, **all free, MIT license, available right now.**
+**gstack is how I do it.** It is my open source software factory. It turns Claude Code into a virtual engineering team you actually manage — a CEO who rethinks the product, an eng manager who locks the architecture, a designer who catches AI slop, a paranoid reviewer who finds production bugs, a QA lead who opens a real browser and clicks through your app, and a release engineer who ships the PR. Eighteen specialists and seven power tools, all as slash commands, all Markdown, **all free, MIT license, available right now.**
 
 I am learning how to get to the edge of what agentic systems can do as of March 2026, and this is my live experiment. I am sharing it because I want the whole world on this journey with me.
 


@@ 48,11 48,11 @@ Expect first useful run in under 5 minutes on any repo with tests already set up
 
 Open Claude Code and paste this. Claude does the rest.
 
-> Install gstack: run **`git clone https://github.com/garrytan/gstack.git ~/.claude/skills/gstack && cd ~/.claude/skills/gstack && ./setup`** then add a "gstack" section to CLAUDE.md that says to use the /browse skill from gstack for all web browsing, never use mcp\_\_claude-in-chrome\_\_\* tools, and lists the available skills: /office-hours, /plan-ceo-review, /plan-eng-review, /plan-design-review, /design-consultation, /review, /ship, /browse, /qa, /qa-only, /design-review, /setup-browser-cookies, /retro, /investigate, /document-release, /codex, /careful, /freeze, /guard, /unfreeze, /gstack-upgrade. Then ask the user if they also want to add gstack to the current project so teammates get it.
+> Install gstack: run **`git clone https://github.com/garrytan/gstack.git ~/.claude/skills/gstack && cd ~/.claude/skills/gstack && ./setup`** then add a "gstack" section to CLAUDE.md that says to use the /browse skill from gstack for all web browsing, never use mcp\_\_claude-in-chrome\_\_\* tools, and lists the available skills: /office-hours, /plan-ceo-review, /plan-eng-review, /plan-design-review, /design-consultation, /review, /ship, /land-and-deploy, /canary, /benchmark, /browse, /qa, /qa-only, /design-review, /setup-browser-cookies, /setup-deploy, /retro, /investigate, /document-release, /codex, /careful, /freeze, /guard, /unfreeze, /gstack-upgrade. Then ask the user if they also want to add gstack to the current project so teammates get it.
 
 ### Step 2: Add to your repo so teammates get it (optional)
 
-> Add gstack to this project: run **`cp -Rf ~/.claude/skills/gstack .claude/skills/gstack && rm -rf .claude/skills/gstack/.git && cd .claude/skills/gstack && ./setup`** then add a "gstack" section to this project's CLAUDE.md that says to use the /browse skill from gstack for all web browsing, never use mcp\_\_claude-in-chrome\_\_\* tools, lists the available skills: /office-hours, /plan-ceo-review, /plan-eng-review, /plan-design-review, /design-consultation, /review, /ship, /browse, /qa, /qa-only, /design-review, /setup-browser-cookies, /retro, /investigate, /document-release, /codex, /careful, /freeze, /guard, /unfreeze, /gstack-upgrade, and tells Claude that if gstack skills aren't working, run `cd .claude/skills/gstack && ./setup` to build the binary and register skills.
+> Add gstack to this project: run **`cp -Rf ~/.claude/skills/gstack .claude/skills/gstack && rm -rf .claude/skills/gstack/.git && cd .claude/skills/gstack && ./setup`** then add a "gstack" section to this project's CLAUDE.md that says to use the /browse skill from gstack for all web browsing, never use mcp\_\_claude-in-chrome\_\_\* tools, lists the available skills: /office-hours, /plan-ceo-review, /plan-eng-review, /plan-design-review, /design-consultation, /review, /ship, /land-and-deploy, /canary, /benchmark, /browse, /qa, /qa-only, /design-review, /setup-browser-cookies, /setup-deploy, /retro, /investigate, /document-release, /codex, /careful, /freeze, /guard, /unfreeze, /gstack-upgrade, and tells Claude that if gstack skills aren't working, run `cd .claude/skills/gstack && ./setup` to build the binary and register skills.
 
 Real files get committed to your repo (not a submodule), so `git clone` just works. Everything lives inside `.claude/`. Nothing touches your PATH or runs in the background.
 


@@ 72,7 72,7 @@ git clone https://github.com/garrytan/gstack.git ~/gstack
 cd ~/gstack && ./setup --host auto
 ```
 
-This installs to `~/.claude/skills/gstack` and/or `~/.codex/skills/gstack` depending on what's available. All 21 skills work across all supported agents. Hook-based safety skills (careful, freeze, guard) use inline safety advisory prose on non-Claude hosts.
+This installs to `~/.claude/skills/gstack` and/or `~/.codex/skills/gstack` depending on what's available. All 25 skills work across all supported agents. Hook-based safety skills (careful, freeze, guard) use inline safety advisory prose on non-Claude hosts.
 
 ## See it work
 


@@ 140,6 140,9 @@ One sprint, one person, one feature — that takes about 30 minutes with gstack.
 | `/qa` | **QA Lead** | Test your app, find bugs, fix them with atomic commits, re-verify. Auto-generates regression tests for every fix. |
 | `/qa-only` | **QA Reporter** | Same methodology as /qa but report only. Use when you want a pure bug report without code changes. |
 | `/ship` | **Release Engineer** | Sync main, run tests, audit coverage, push, open PR. Bootstraps test frameworks if you don't have one. One command. |
+| `/land-and-deploy` | **Release Engineer** | Merge the PR, wait for CI and deploy, verify production health. Takes over after `/ship`. One command from "approved" to "verified in production." |
+| `/canary` | **SRE** | Post-deploy monitoring loop. Watches for console errors, performance regressions, and page failures. Periodic screenshots and anomaly detection. |
+| `/benchmark` | **Performance Engineer** | Baseline page load times, Core Web Vitals, and resource sizes. Compare before/after on every PR. Catch bundle size regressions before they ship. |
 | `/document-release` | **Technical Writer** | Update all project docs to match what you just shipped. Catches stale READMEs automatically. |
 | `/retro` | **Eng Manager** | Team-aware weekly retro. Per-person breakdowns, shipping streaks, test health trends, growth opportunities. |
 | `/browse` | **QA Engineer** | Give the agent eyes. Real Chromium browser, real clicks, real screenshots. ~100ms per command. |


@@ 154,6 157,7 @@ One sprint, one person, one feature — that takes about 30 minutes with gstack.
 | `/freeze` | **Edit Lock** — restrict file edits to one directory. Prevents accidental changes outside scope while debugging. |
 | `/guard` | **Full Safety** — `/careful` + `/freeze` in one command. Maximum safety for prod work. |
 | `/unfreeze` | **Unlock** — remove the `/freeze` boundary. |
+| `/setup-deploy` | **Deploy Configurator** — one-time setup for `/land-and-deploy`. Detects your platform, production URL, and deploy commands. |
 | `/gstack-upgrade` | **Self-Updater** — upgrade gstack to latest. Detects global vs vendored install, syncs both, shows what changed. |
 
 **[Deep dives with examples and philosophy for every skill →](docs/skills.md)**


@@ 170,6 174,8 @@ One sprint, one person, one feature — that takes about 30 minutes with gstack.
 
 **Test everything.** `/ship` bootstraps test frameworks from scratch if your project doesn't have one. Every `/ship` run produces a coverage audit. Every `/qa` bug fix generates a regression test. 100% test coverage is the goal — tests make vibe coding safe instead of yolo coding.
 
+**Ship to production in one command.** `/land-and-deploy` picks up where `/ship` left off — merges your PR, waits for CI and deploy, then runs canary verification on your production URL. Auto-detects Fly.io, Render, Vercel, Netlify, Heroku, or GitHub Actions. If something breaks, it offers a revert. Pair with `/canary` for extended post-deploy monitoring and `/benchmark` to catch performance regressions before they ship.
+
 **`/document-release` is the engineer you never had.** It reads every doc file in your project, cross-references the diff, and updates everything that drifted. README, ARCHITECTURE, CONTRIBUTING, CLAUDE.md, TODOS — all kept current automatically. And now `/ship` auto-invokes it — docs stay current without an extra command.
 
 **Browser handoff when the AI gets stuck.** Hit a CAPTCHA, auth wall, or MFA prompt? `$B handoff` opens a visible Chrome at the exact same page with all your cookies and tabs intact. Solve the problem, tell Claude you're done, `$B resume` picks up right where it left off. The agent even suggests it automatically after 3 consecutive failures.


@@ 200,7 206,7 @@ Same tools, different outcome — because gstack gives you structured roles and 
 
 The models are getting better fast. The people who figure out how to work with them now — really work with them, not just dabble — are going to have a massive advantage. This is that window. Let's go.
 
-Fifteen specialists and six power tools. All slash commands. All Markdown. All free. **[github.com/garrytan/gstack](https://github.com/garrytan/gstack)** — MIT License
+Eighteen specialists and seven power tools. All slash commands. All Markdown. All free. **[github.com/garrytan/gstack](https://github.com/garrytan/gstack)** — MIT License
 
 > **We're hiring.** Want to ship 10K+ LOC/day and help harden gstack?
 > Come work at YC — [ycombinator.com/software](https://ycombinator.com/software)

M TODOS.md => TODOS.md +10 -5

@@ 515,11 515,16 @@ Shipped in v0.6.5. TemplateContext in gen-skill-docs.ts bakes skill name into pr
 
 ## Completed
 
-### Deploy pipeline (v0.7.0)
-- /merge skill — review-gated PR merge → superseded by /land-and-deploy
-- Deploy-verify skill → superseded by /land-and-deploy canary verification
-- Post-deploy verification (ship + browse) → superseded by /land-and-deploy
-**Completed:** v0.7.0
+### Deploy pipeline (v0.9.8.0)
+- /land-and-deploy — merge PR, wait for CI/deploy, canary verification
+- /canary — post-deploy monitoring loop with anomaly detection
+- /benchmark — performance regression detection with Core Web Vitals
+- /setup-deploy — one-time deploy platform configuration
+- /review Performance & Bundle Impact pass
+- E2E model pinning (Sonnet default, Opus for quality tests)
+- E2E timing telemetry (first_response_ms, max_inter_turn_ms, wall_clock_ms)
+- test:e2e:fast tier, --retry 2 on all E2E scripts
+**Completed:** v0.9.8.0
 
 ### Phase 1: Foundations (v0.2.0)
 - Rename to gstack

M VERSION => VERSION +1 -1

@@ 1,1 1,1 @@
-0.9.7.0
+0.9.8.0

M land-and-deploy/SKILL.md => land-and-deploy/SKILL.md +177 -4

@@ 293,11 293,14 @@ When the user types `/land-and-deploy`, run this skill.
 - `/land-and-deploy #123` — specific PR number
 - `/land-and-deploy #123 <url>` — specific PR + verification URL
 
-## Non-interactive philosophy (like /ship)
+## Non-interactive philosophy (like /ship) — with one critical gate
 
-This is a **non-interactive, fully automated** workflow. Do NOT ask for confirmation at any step except the ones listed below. The user said `/land-and-deploy` which means DO IT.
+This is a **mostly automated** workflow. Do NOT ask for confirmation at any step except
+the ones listed below. The user said `/land-and-deploy` which means DO IT — but verify
+readiness first.
 
-**Only stop for:**
+**Always stop for:**
+- **Pre-merge readiness gate (Step 3.5)** — this is the ONE confirmation before merge
 - GitHub CLI not authenticated
 - No PR found for this branch
 - CI failures or merge conflicts


@@ 307,7 310,6 @@ This is a **non-interactive, fully automated** workflow. Do NOT ask for confirma
 
 **Never stop for:**
 - Choosing merge method (auto-detect from repo settings)
-- Confirming the merge
 - Timeout warnings (warn and continue gracefully)
 
 ---


@@ 372,6 374,177 @@ If timeout (15 min): **STOP.** "CI has been running for 15 minutes. Investigate 
 
 ---
 
+## Step 3.5: Pre-merge readiness gate
+
+**This is the critical safety check before an irreversible merge.** The merge cannot
+be undone without a revert commit. Gather ALL evidence, build a readiness report,
+and get explicit user confirmation before proceeding.
+
+Collect evidence for each check below. Track warnings (yellow) and blockers (red).
+
+### 3.5a: Review staleness check
+
+```bash
+~/.claude/skills/gstack/bin/gstack-review-read 2>/dev/null
+```
+
+Parse the output. For each review skill (plan-eng-review, plan-ceo-review,
+plan-design-review, design-review-lite, codex-review):
+
+1. Find the most recent entry within the last 7 days.
+2. Extract its `commit` field.
+3. Compare against current HEAD: `git rev-list --count STORED_COMMIT..HEAD`
+
+**Staleness rules:**
+- 0 commits since review → CURRENT
+- 1-3 commits since review → RECENT (yellow if those commits touch code, not just docs)
+- 4+ commits since review → STALE (red — review may not reflect current code)
+- No review found → NOT RUN
+
+**Critical check:** Look at what changed AFTER the last review. Run:
+```bash
+git log --oneline STORED_COMMIT..HEAD
+```
+If any commits after the review contain words like "fix", "refactor", "rewrite",
+"overhaul", or touch more than 5 files — flag as **STALE (significant changes
+since review)**. The review was done on different code than what's about to merge.
+
+### 3.5b: Test results
+
+**Free tests — run them now:**
+
+Read CLAUDE.md to find the project's test command. If not specified, use `bun test`.
+Run the test command and capture the exit code and output.
+
+```bash
+bun test 2>&1 | tail -10
+```
+
+If tests fail: **BLOCKER.** Cannot merge with failing tests.
+
+**E2E tests — check recent results:**
+
+```bash
+ls -t ~/.gstack-dev/evals/*-e2e-*-$(date +%Y-%m-%d)*.json 2>/dev/null | head -20
+```
+
+For each eval file from today, parse pass/fail counts. Show:
+- Total tests, pass count, fail count
+- How long ago the run finished (from file timestamp)
+- Total cost
+- Names of any failing tests
+
+If no E2E results from today: **WARNING — no E2E tests run today.**
+If E2E results exist but have failures: **WARNING — N tests failed.** List them.
+
+**LLM judge evals — check recent results:**
+
+```bash
+ls -t ~/.gstack-dev/evals/*-llm-judge-*-$(date +%Y-%m-%d)*.json 2>/dev/null | head -5
+```
+
+If found, parse and show pass/fail. If not found, note "No LLM evals run today."
+
+### 3.5c: PR body accuracy check
+
+Read the current PR body:
+```bash
+gh pr view --json body -q .body
+```
+
+Read the current diff summary:
+```bash
+git log --oneline $(gh pr view --json baseRefName -q .baseRefName 2>/dev/null || echo main)..HEAD | head -20
+```
+
+Compare the PR body against the actual commits. Check for:
+1. **Missing features** — commits that add significant functionality not mentioned in the PR
+2. **Stale descriptions** — PR body mentions things that were later changed or reverted
+3. **Wrong version** — PR title or body references a version that doesn't match VERSION file
+
+If the PR body looks stale or incomplete: **WARNING — PR body may not reflect current
+changes.** List what's missing or stale.
+
+### 3.5d: Document-release check
+
+Check if documentation was updated on this branch:
+
+```bash
+git log --oneline --all-match --grep="docs:" $(gh pr view --json baseRefName -q .baseRefName 2>/dev/null || echo main)..HEAD | head -5
+```
+
+Also check if key doc files were modified:
+```bash
+git diff --name-only $(gh pr view --json baseRefName -q .baseRefName 2>/dev/null || echo main)...HEAD -- README.md CHANGELOG.md ARCHITECTURE.md CONTRIBUTING.md CLAUDE.md VERSION
+```
+
+If CHANGELOG.md and VERSION were NOT modified on this branch and the diff includes
+new features (new files, new commands, new skills): **WARNING — /document-release
+likely not run. CHANGELOG and VERSION not updated despite new features.**
+
+If only docs changed (no code): skip this check.
+
+### 3.5e: Readiness report and confirmation
+
+Build the full readiness report:
+
+```
+╔══════════════════════════════════════════════════════════╗
+║              PRE-MERGE READINESS REPORT                  ║
+╠══════════════════════════════════════════════════════════╣
+║                                                          ║
+║  PR: #NNN — title                                        ║
+║  Branch: feature → main                                  ║
+║                                                          ║
+║  REVIEWS                                                 ║
+║  ├─ Eng Review:    CURRENT / STALE (N commits) / —       ║
+║  ├─ CEO Review:    CURRENT / — (optional)                ║
+║  ├─ Design Review: CURRENT / — (optional)                ║
+║  └─ Codex Review:  CURRENT / — (optional)                ║
+║                                                          ║
+║  TESTS                                                   ║
+║  ├─ Free tests:    PASS / FAIL (blocker)                 ║
+║  ├─ E2E tests:     52/52 pass (25 min ago) / NOT RUN     ║
+║  └─ LLM evals:     PASS / NOT RUN                        ║
+║                                                          ║
+║  DOCUMENTATION                                           ║
+║  ├─ CHANGELOG:     Updated / NOT UPDATED (warning)       ║
+║  ├─ VERSION:       0.9.8.0 / NOT BUMPED (warning)        ║
+║  └─ Doc release:   Run / NOT RUN (warning)               ║
+║                                                          ║
+║  PR BODY                                                 ║
+║  └─ Accuracy:      Current / STALE (warning)             ║
+║                                                          ║
+║  WARNINGS: N  |  BLOCKERS: N                             ║
+╚══════════════════════════════════════════════════════════╝
+```
+
+If there are BLOCKERS (failing free tests): list them and recommend B.
+If there are WARNINGS but no blockers: list each warning and recommend A if
+warnings are minor, or B if warnings are significant.
+If everything is green: recommend A.
+
+Use AskUserQuestion:
+
+- **Re-ground:** "About to merge PR #NNN (title) from branch X to Y. Here's the
+  readiness report." Show the report above.
+- List each warning and blocker explicitly.
+- **RECOMMENDATION:** Choose A if green. Choose B if there are significant warnings.
+  Choose C only if the user understands the risks.
+- A) Merge — readiness checks passed (Completeness: 10/10)
+- B) Don't merge yet — address the warnings first (Completeness: 10/10)
+- C) Merge anyway — I understand the risks (Completeness: 3/10)
+
+If the user chooses B: **STOP.** List exactly what needs to be done:
+- If reviews are stale: "Re-run /plan-eng-review (or /review) to review current code."
+- If E2E not run: "Run `bun run test:e2e` to verify."
+- If docs not updated: "Run /document-release to update documentation."
+- If PR body stale: "Update the PR body to reflect current changes."
+
+If the user chooses A or C: continue to Step 4.
+
+---
+
 ## Step 4: Merge the PR
 
 Record the start timestamp for timing data.

M land-and-deploy/SKILL.md.tmpl => land-and-deploy/SKILL.md.tmpl +177 -4

@@ 35,11 35,14 @@ When the user types `/land-and-deploy`, run this skill.
 - `/land-and-deploy #123` — specific PR number
 - `/land-and-deploy #123 <url>` — specific PR + verification URL
 
-## Non-interactive philosophy (like /ship)
+## Non-interactive philosophy (like /ship) — with one critical gate
 
-This is a **non-interactive, fully automated** workflow. Do NOT ask for confirmation at any step except the ones listed below. The user said `/land-and-deploy` which means DO IT.
+This is a **mostly automated** workflow. Do NOT ask for confirmation at any step except
+the ones listed below. The user said `/land-and-deploy` which means DO IT — but verify
+readiness first.
 
-**Only stop for:**
+**Always stop for:**
+- **Pre-merge readiness gate (Step 3.5)** — this is the ONE confirmation before merge
 - GitHub CLI not authenticated
 - No PR found for this branch
 - CI failures or merge conflicts


@@ 49,7 52,6 @@ This is a **non-interactive, fully automated** workflow. Do NOT ask for confirma
 
 **Never stop for:**
 - Choosing merge method (auto-detect from repo settings)
-- Confirming the merge
 - Timeout warnings (warn and continue gracefully)
 
 ---


@@ 114,6 116,177 @@ If timeout (15 min): **STOP.** "CI has been running for 15 minutes. Investigate 
 
 ---
 
+## Step 3.5: Pre-merge readiness gate
+
+**This is the critical safety check before an irreversible merge.** The merge cannot
+be undone without a revert commit. Gather ALL evidence, build a readiness report,
+and get explicit user confirmation before proceeding.
+
+Collect evidence for each check below. Track warnings (yellow) and blockers (red).
+
+### 3.5a: Review staleness check
+
+```bash
+~/.claude/skills/gstack/bin/gstack-review-read 2>/dev/null
+```
+
+Parse the output. For each review skill (plan-eng-review, plan-ceo-review,
+plan-design-review, design-review-lite, codex-review):
+
+1. Find the most recent entry within the last 7 days.
+2. Extract its `commit` field.
+3. Compare against current HEAD: `git rev-list --count STORED_COMMIT..HEAD`
+
+**Staleness rules:**
+- 0 commits since review → CURRENT
+- 1-3 commits since review → RECENT (yellow if those commits touch code, not just docs)
+- 4+ commits since review → STALE (red — review may not reflect current code)
+- No review found → NOT RUN
+
+**Critical check:** Look at what changed AFTER the last review. Run:
+```bash
+git log --oneline STORED_COMMIT..HEAD
+```
+If any commits after the review contain words like "fix", "refactor", "rewrite",
+"overhaul", or touch more than 5 files — flag as **STALE (significant changes
+since review)**. The review was done on different code than what's about to merge.
+
+### 3.5b: Test results
+
+**Free tests — run them now:**
+
+Read CLAUDE.md to find the project's test command. If not specified, use `bun test`.
+Run the test command and capture the exit code and output.
+
+```bash
+bun test 2>&1 | tail -10
+```
+
+If tests fail: **BLOCKER.** Cannot merge with failing tests.
+
+**E2E tests — check recent results:**
+
+```bash
+ls -t ~/.gstack-dev/evals/*-e2e-*-$(date +%Y-%m-%d)*.json 2>/dev/null | head -20
+```
+
+For each eval file from today, parse pass/fail counts. Show:
+- Total tests, pass count, fail count
+- How long ago the run finished (from file timestamp)
+- Total cost
+- Names of any failing tests
+
+If no E2E results from today: **WARNING — no E2E tests run today.**
+If E2E results exist but have failures: **WARNING — N tests failed.** List them.
+
+**LLM judge evals — check recent results:**
+
+```bash
+ls -t ~/.gstack-dev/evals/*-llm-judge-*-$(date +%Y-%m-%d)*.json 2>/dev/null | head -5
+```
+
+If found, parse and show pass/fail. If not found, note "No LLM evals run today."
+
+### 3.5c: PR body accuracy check
+
+Read the current PR body:
+```bash
+gh pr view --json body -q .body
+```
+
+Read the current diff summary:
+```bash
+git log --oneline $(gh pr view --json baseRefName -q .baseRefName 2>/dev/null || echo main)..HEAD | head -20
+```
+
+Compare the PR body against the actual commits. Check for:
+1. **Missing features** — commits that add significant functionality not mentioned in the PR
+2. **Stale descriptions** — PR body mentions things that were later changed or reverted
+3. **Wrong version** — PR title or body references a version that doesn't match VERSION file
+
+If the PR body looks stale or incomplete: **WARNING — PR body may not reflect current
+changes.** List what's missing or stale.
+
+### 3.5d: Document-release check
+
+Check if documentation was updated on this branch:
+
+```bash
+git log --oneline --all-match --grep="docs:" $(gh pr view --json baseRefName -q .baseRefName 2>/dev/null || echo main)..HEAD | head -5
+```
+
+Also check if key doc files were modified:
+```bash
+git diff --name-only $(gh pr view --json baseRefName -q .baseRefName 2>/dev/null || echo main)...HEAD -- README.md CHANGELOG.md ARCHITECTURE.md CONTRIBUTING.md CLAUDE.md VERSION
+```
+
+If CHANGELOG.md and VERSION were NOT modified on this branch and the diff includes
+new features (new files, new commands, new skills): **WARNING — /document-release
+likely not run. CHANGELOG and VERSION not updated despite new features.**
+
+If only docs changed (no code): skip this check.
+
+### 3.5e: Readiness report and confirmation
+
+Build the full readiness report:
+
+```
+╔══════════════════════════════════════════════════════════╗
+║              PRE-MERGE READINESS REPORT                  ║
+╠══════════════════════════════════════════════════════════╣
+║                                                          ║
+║  PR: #NNN — title                                        ║
+║  Branch: feature → main                                  ║
+║                                                          ║
+║  REVIEWS                                                 ║
+║  ├─ Eng Review:    CURRENT / STALE (N commits) / —       ║
+║  ├─ CEO Review:    CURRENT / — (optional)                ║
+║  ├─ Design Review: CURRENT / — (optional)                ║
+║  └─ Codex Review:  CURRENT / — (optional)                ║
+║                                                          ║
+║  TESTS                                                   ║
+║  ├─ Free tests:    PASS / FAIL (blocker)                 ║
+║  ├─ E2E tests:     52/52 pass (25 min ago) / NOT RUN     ║
+║  └─ LLM evals:     PASS / NOT RUN                        ║
+║                                                          ║
+║  DOCUMENTATION                                           ║
+║  ├─ CHANGELOG:     Updated / NOT UPDATED (warning)       ║
+║  ├─ VERSION:       0.9.8.0 / NOT BUMPED (warning)        ║
+║  └─ Doc release:   Run / NOT RUN (warning)               ║
+║                                                          ║
+║  PR BODY                                                 ║
+║  └─ Accuracy:      Current / STALE (warning)             ║
+║                                                          ║
+║  WARNINGS: N  |  BLOCKERS: N                             ║
+╚══════════════════════════════════════════════════════════╝
+```
+
+If there are BLOCKERS (failing free tests): list them and recommend B.
+If there are WARNINGS but no blockers: list each warning and recommend A if
+warnings are minor, or B if warnings are significant.
+If everything is green: recommend A.
+
+Use AskUserQuestion:
+
+- **Re-ground:** "About to merge PR #NNN (title) from branch X to Y. Here's the
+  readiness report." Show the report above.
+- List each warning and blocker explicitly.
+- **RECOMMENDATION:** Choose A if green. Choose B if there are significant warnings.
+  Choose C only if the user understands the risks.
+- A) Merge — readiness checks passed (Completeness: 10/10)
+- B) Don't merge yet — address the warnings first (Completeness: 10/10)
+- C) Merge anyway — I understand the risks (Completeness: 3/10)
+
+If the user chooses B: **STOP.** List exactly what needs to be done:
+- If reviews are stale: "Re-run /plan-eng-review (or /review) to review current code."
+- If E2E not run: "Run `bun run test:e2e` to verify."
+- If docs not updated: "Run /document-release to update documentation."
+- If PR body stale: "Update the PR body to reflect current changes."
+
+If the user chooses A or C: continue to Step 4.
+
+---
+
 ## Step 4: Merge the PR
 
 Record the start timestamp for timing data.

M package.json => package.json +1 -1

@@ 1,6 1,6 @@
 {
   "name": "gstack",
-  "version": "0.3.3",
+  "version": "0.9.8.0",
   "description": "Garry's Stack — Claude Code skills + fast headless browser. One repo, one install, entire AI engineering workflow.",
   "license": "MIT",
   "type": "module",