--- name: ship version: 1.0.0 description: | Ship workflow: detect + merge base branch, run tests, review diff, bump VERSION, update CHANGELOG, commit, push, create PR. allowed-tools: - Bash - Read - Write - Edit - Grep - Glob - AskUserQuestion - WebSearch --- {{PREAMBLE}} {{BASE_BRANCH_DETECT}} # Ship: Fully Automated Ship Workflow You are running the `/ship` workflow. This is a **non-interactive, fully automated** workflow. Do NOT ask for confirmation at any step. The user said `/ship` which means DO IT. Run straight through and output the PR URL at the end. **Only stop for:** - On the base branch (abort) - Merge conflicts that can't be auto-resolved (stop, show conflicts) - Test failures (stop, show failures) - Pre-landing review finds ASK items that need user judgment - MINOR or MAJOR version bump needed (ask — see Step 4) - Greptile review comments that need user decision (complex fixes, false positives) - TODOS.md missing and user wants to create one (ask — see Step 5.5) - TODOS.md disorganized and user wants to reorganize (ask — see Step 5.5) **Never stop for:** - Uncommitted changes (always include them) - Version bump choice (auto-pick MICRO or PATCH — see Step 4) - CHANGELOG content (auto-generate from diff) - Commit message approval (auto-commit) - Multi-file changesets (auto-split into bisectable commits) - TODOS.md completed-item detection (auto-mark) - Auto-fixable review findings (dead code, N+1, stale comments — fixed automatically) - Test coverage gaps (auto-generate and commit, or flag in PR body) --- ## Step 1: Pre-flight 1. Check the current branch. If on the base branch or the repo's default branch, **abort**: "You're on the base branch. Ship from a feature branch." 2. Run `git status` (never use `-uall`). Uncommitted changes are always included — no need to ask. 3. Run `git diff ...HEAD --stat` and `git log ..HEAD --oneline` to understand what's being shipped. 4. Check review readiness: {{REVIEW_DASHBOARD}} If the Eng Review is NOT "CLEAR": 1. **Check for a prior override on this branch:** ```bash eval $(~/.claude/skills/gstack/bin/gstack-slug 2>/dev/null) grep '"skill":"ship-review-override"' ~/.gstack/projects/$SLUG/$BRANCH-reviews.jsonl 2>/dev/null || echo "NO_OVERRIDE" ``` If an override exists, display the dashboard and note "Review gate previously accepted — continuing." Do NOT ask again. 2. **If no override exists,** use AskUserQuestion: - Show that Eng Review is missing or has open issues - RECOMMENDATION: Choose C if the change is obviously trivial (< 20 lines, typo fix, config-only); Choose B for larger changes - Options: A) Ship anyway B) Abort — run /plan-eng-review first C) Change is too small to need eng review - If CEO Review is missing, mention as informational ("CEO Review not run — recommended for product changes") but do NOT block - For Design Review: run `eval $(~/.claude/skills/gstack/bin/gstack-diff-scope 2>/dev/null)`. If `SCOPE_FRONTEND=true` and no design review (plan-design-review or design-review-lite) exists in the dashboard, mention: "Design Review not run — this PR changes frontend code. The lite design check will run automatically in Step 3.5, but consider running /plan-design-review for a full visual audit." Still never block. 3. **If the user chooses A or C,** persist the decision so future `/ship` runs on this branch skip the gate: ```bash eval $(~/.claude/skills/gstack/bin/gstack-slug 2>/dev/null) echo '{"skill":"ship-review-override","timestamp":"'"$(date -u +%Y-%m-%dT%H:%M:%SZ)"'","decision":"USER_CHOICE"}' >> ~/.gstack/projects/$SLUG/$BRANCH-reviews.jsonl ``` Substitute USER_CHOICE with "ship_anyway" or "not_relevant". --- ## Step 2: Merge the base branch (BEFORE tests) Fetch and merge the base branch into the feature branch so tests run against the merged state: ```bash git fetch origin && git merge origin/ --no-edit ``` **If there are merge conflicts:** Try to auto-resolve if they are simple (VERSION, schema.rb, CHANGELOG ordering). If conflicts are complex or ambiguous, **STOP** and show them. **If already up to date:** Continue silently. --- ## Step 2.5: Test Framework Bootstrap {{TEST_BOOTSTRAP}} --- ## Step 3: Run tests (on merged code) **Do NOT run `RAILS_ENV=test bin/rails db:migrate`** — `bin/test-lane` already calls `db:test:prepare` internally, which loads the schema into the correct lane database. Running bare test migrations without INSTANCE hits an orphan DB and corrupts structure.sql. Run both test suites in parallel: ```bash bin/test-lane 2>&1 | tee /tmp/ship_tests.txt & npm run test 2>&1 | tee /tmp/ship_vitest.txt & wait ``` After both complete, read the output files and check pass/fail. **If any test fails:** Show the failures and **STOP**. Do not proceed. **If all pass:** Continue silently — just note the counts briefly. --- ## Step 3.25: Eval Suites (conditional) Evals are mandatory when prompt-related files change. Skip this step entirely if no prompt files are in the diff. **1. Check if the diff touches prompt-related files:** ```bash git diff origin/ --name-only ``` Match against these patterns (from CLAUDE.md): - `app/services/*_prompt_builder.rb` - `app/services/*_generation_service.rb`, `*_writer_service.rb`, `*_designer_service.rb` - `app/services/*_evaluator.rb`, `*_scorer.rb`, `*_classifier_service.rb`, `*_analyzer.rb` - `app/services/concerns/*voice*.rb`, `*writing*.rb`, `*prompt*.rb`, `*token*.rb` - `app/services/chat_tools/*.rb`, `app/services/x_thread_tools/*.rb` - `config/system_prompts/*.txt` - `test/evals/**/*` (eval infrastructure changes affect all suites) **If no matches:** Print "No prompt-related files changed — skipping evals." and continue to Step 3.5. **2. Identify affected eval suites:** Each eval runner (`test/evals/*_eval_runner.rb`) declares `PROMPT_SOURCE_FILES` listing which source files affect it. Grep these to find which suites match the changed files: ```bash grep -l "changed_file_basename" test/evals/*_eval_runner.rb ``` Map runner → test file: `post_generation_eval_runner.rb` → `post_generation_eval_test.rb`. **Special cases:** - Changes to `test/evals/judges/*.rb`, `test/evals/support/*.rb`, or `test/evals/fixtures/` affect ALL suites that use those judges/support files. Check imports in the eval test files to determine which. - Changes to `config/system_prompts/*.txt` — grep eval runners for the prompt filename to find affected suites. - If unsure which suites are affected, run ALL suites that could plausibly be impacted. Over-testing is better than missing a regression. **3. Run affected suites at `EVAL_JUDGE_TIER=full`:** `/ship` is a pre-merge gate, so always use full tier (Sonnet structural + Opus persona judges). ```bash EVAL_JUDGE_TIER=full EVAL_VERBOSE=1 bin/test-lane --eval test/evals/_eval_test.rb 2>&1 | tee /tmp/ship_evals.txt ``` If multiple suites need to run, run them sequentially (each needs a test lane). If the first suite fails, stop immediately — don't burn API cost on remaining suites. **4. Check results:** - **If any eval fails:** Show the failures, the cost dashboard, and **STOP**. Do not proceed. - **If all pass:** Note pass counts and cost. Continue to Step 3.5. **5. Save eval output** — include eval results and cost dashboard in the PR body (Step 8). **Tier reference (for context — /ship always uses `full`):** | Tier | When | Speed (cached) | Cost | |------|------|----------------|------| | `fast` (Haiku) | Dev iteration, smoke tests | ~5s (14x faster) | ~$0.07/run | | `standard` (Sonnet) | Default dev, `bin/test-lane --eval` | ~17s (4x faster) | ~$0.37/run | | `full` (Opus persona) | **`/ship` and pre-merge** | ~72s (baseline) | ~$1.27/run | --- ## Step 3.4: Test Coverage Audit 100% coverage is the goal — every untested path is a path where bugs hide and vibe coding becomes yolo coding. Evaluate what was ACTUALLY coded (from the diff), not what was planned. **0. Before/after test count:** ```bash # Count test files before any generation find . -name '*.test.*' -o -name '*.spec.*' -o -name '*_test.*' -o -name '*_spec.*' | grep -v node_modules | wc -l ``` Store this number for the PR body. **1. Trace every codepath changed** using `git diff origin/...HEAD`: Read every changed file. For each one, trace how data flows through the code — don't just list functions, actually follow the execution: 1. **Read the diff.** For each changed file, read the full file (not just the diff hunk) to understand context. 2. **Trace data flow.** Starting from each entry point (route handler, exported function, event listener, component render), follow the data through every branch: - Where does input come from? (request params, props, database, API call) - What transforms it? (validation, mapping, computation) - Where does it go? (database write, API response, rendered output, side effect) - What can go wrong at each step? (null/undefined, invalid input, network failure, empty collection) 3. **Diagram the execution.** For each changed file, draw an ASCII diagram showing: - Every function/method that was added or modified - Every conditional branch (if/else, switch, ternary, guard clause, early return) - Every error path (try/catch, rescue, error boundary, fallback) - Every call to another function (trace into it — does IT have untested branches?) - Every edge: what happens with null input? Empty array? Invalid type? This is the critical step — you're building a map of every line of code that can execute differently based on input. Every branch in this diagram needs a test. **2. Map user flows, interactions, and error states:** Code coverage isn't enough — you need to cover how real users interact with the changed code. For each changed feature, think through: - **User flows:** What sequence of actions does a user take that touches this code? Map the full journey (e.g., "user clicks 'Pay' → form validates → API call → success/failure screen"). Each step in the journey needs a test. - **Interaction edge cases:** What happens when the user does something unexpected? - Double-click/rapid resubmit - Navigate away mid-operation (back button, close tab, click another link) - Submit with stale data (page sat open for 30 minutes, session expired) - Slow connection (API takes 10 seconds — what does the user see?) - Concurrent actions (two tabs, same form) - **Error states the user can see:** For every error the code handles, what does the user actually experience? - Is there a clear error message or a silent failure? - Can the user recover (retry, go back, fix input) or are they stuck? - What happens with no network? With a 500 from the API? With invalid data from the server? - **Empty/zero/boundary states:** What does the UI show with zero results? With 10,000 results? With a single character input? With maximum-length input? Add these to your diagram alongside the code branches. A user flow with no test is just as much a gap as an untested if/else. **3. Check each branch against existing tests:** Go through your diagram branch by branch — both code paths AND user flows. For each one, search for a test that exercises it: - Function `processPayment()` → look for `billing.test.ts`, `billing.spec.ts`, `test/billing_test.rb` - An if/else → look for tests covering BOTH the true AND false path - An error handler → look for a test that triggers that specific error condition - A call to `helperFn()` that has its own branches → those branches need tests too - A user flow → look for an integration or E2E test that walks through the journey - An interaction edge case → look for a test that simulates the unexpected action Quality scoring rubric: - ★★★ Tests behavior with edge cases AND error paths - ★★ Tests correct behavior, happy path only - ★ Smoke test / existence check / trivial assertion (e.g., "it renders", "it doesn't throw") **4. Output ASCII coverage diagram:** Include BOTH code paths and user flows in the same diagram: ``` CODE PATH COVERAGE =========================== [+] src/services/billing.ts │ ├── processPayment() │ ├── [★★★ TESTED] Happy path + card declined + timeout — billing.test.ts:42 │ ├── [GAP] Network timeout — NO TEST │ └── [GAP] Invalid currency — NO TEST │ └── refundPayment() ├── [★★ TESTED] Full refund — billing.test.ts:89 └── [★ TESTED] Partial refund (checks non-throw only) — billing.test.ts:101 USER FLOW COVERAGE =========================== [+] Payment checkout flow │ ├── [★★★ TESTED] Complete purchase — checkout.e2e.ts:15 ├── [GAP] Double-click submit — NO TEST ├── [GAP] Navigate away during payment — NO TEST └── [★ TESTED] Form validation errors (checks render only) — checkout.test.ts:40 [+] Error states │ ├── [★★ TESTED] Card declined message — billing.test.ts:58 ├── [GAP] Network timeout UX (what does user see?) — NO TEST └── [GAP] Empty cart submission — NO TEST ───────────────────────────────── COVERAGE: 5/12 paths tested (42%) Code paths: 3/5 (60%) User flows: 2/7 (29%) QUALITY: ★★★: 2 ★★: 2 ★: 1 GAPS: 7 paths need tests ───────────────────────────────── ``` **Fast path:** All paths covered → "Step 3.4: All new code paths have test coverage ✓" Continue. **5. Generate tests for uncovered paths:** If test framework detected (or bootstrapped in Step 2.5): - Prioritize error handlers and edge cases first (happy paths are more likely already tested) - Read 2-3 existing test files to match conventions exactly - Generate unit tests. Mock all external dependencies (DB, API, Redis). - Write tests that exercise the specific uncovered path with real assertions - Run each test. Passes → commit as `test: coverage for {feature}` - Fails → fix once. Still fails → revert, note gap in diagram. Caps: 30 code paths max, 20 tests generated max (code + user flow combined), 2-min per-test exploration cap. If no test framework AND user declined bootstrap → diagram only, no generation. Note: "Test generation skipped — no test framework configured." **Diff is test-only changes:** Skip Step 3.4 entirely: "No new application code paths to audit." **6. After-count and coverage summary:** ```bash # Count test files after generation find . -name '*.test.*' -o -name '*.spec.*' -o -name '*_test.*' -o -name '*_spec.*' | grep -v node_modules | wc -l ``` For PR body: `Tests: {before} → {after} (+{delta} new)` Coverage line: `Test Coverage Audit: N new code paths. M covered (X%). K tests generated, J committed.` --- ## Step 3.5: Pre-Landing Review Review the diff for structural issues that tests don't catch. 1. Read `.claude/skills/review/checklist.md`. If the file cannot be read, **STOP** and report the error. 2. Run `git diff origin/` to get the full diff (scoped to feature changes against the freshly-fetched base branch). 3. Apply the review checklist in two passes: - **Pass 1 (CRITICAL):** SQL & Data Safety, LLM Output Trust Boundary - **Pass 2 (INFORMATIONAL):** All remaining categories {{DESIGN_REVIEW_LITE}} Include any design findings alongside the code review findings. They follow the same Fix-First flow below. 4. **Classify each finding as AUTO-FIX or ASK** per the Fix-First Heuristic in checklist.md. Critical findings lean toward ASK; informational lean toward AUTO-FIX. 5. **Auto-fix all AUTO-FIX items.** Apply each fix. Output one line per fix: `[AUTO-FIXED] [file:line] Problem → what you did` 6. **If ASK items remain,** present them in ONE AskUserQuestion: - List each with number, severity, problem, recommended fix - Per-item options: A) Fix B) Skip - Overall RECOMMENDATION - If 3 or fewer ASK items, you may use individual AskUserQuestion calls instead 7. **After all fixes (auto + user-approved):** - If ANY fixes were applied: commit fixed files by name (`git add && git commit -m "fix: pre-landing review fixes"`), then **STOP** and tell the user to run `/ship` again to re-test. - If no fixes applied (all ASK items skipped, or no issues found): continue to Step 4. 8. Output summary: `Pre-Landing Review: N issues — M auto-fixed, K asked (J fixed, L skipped)` If no issues found: `Pre-Landing Review: No issues found.` Save the review output — it goes into the PR body in Step 8. --- ## Step 3.75: Address Greptile review comments (if PR exists) Read `.claude/skills/review/greptile-triage.md` and follow the fetch, filter, classify, and **escalation detection** steps. **If no PR exists, `gh` fails, API returns an error, or there are zero Greptile comments:** Skip this step silently. Continue to Step 4. **If Greptile comments are found:** Include a Greptile summary in your output: `+ N Greptile comments (X valid, Y fixed, Z FP)` Before replying to any comment, run the **Escalation Detection** algorithm from greptile-triage.md to determine whether to use Tier 1 (friendly) or Tier 2 (firm) reply templates. For each classified comment: **VALID & ACTIONABLE:** Use AskUserQuestion with: - The comment (file:line or [top-level] + body summary + permalink URL) - `RECOMMENDATION: Choose A because [one-line reason]` - Options: A) Fix now, B) Acknowledge and ship anyway, C) It's a false positive - If user chooses A: apply the fix, commit the fixed files (`git add && git commit -m "fix: address Greptile review — "`), reply using the **Fix reply template** from greptile-triage.md (include inline diff + explanation), and save to both per-project and global greptile-history (type: fix). - If user chooses C: reply using the **False Positive reply template** from greptile-triage.md (include evidence + suggested re-rank), save to both per-project and global greptile-history (type: fp). **VALID BUT ALREADY FIXED:** Reply using the **Already Fixed reply template** from greptile-triage.md — no AskUserQuestion needed: - Include what was done and the fixing commit SHA - Save to both per-project and global greptile-history (type: already-fixed) **FALSE POSITIVE:** Use AskUserQuestion: - Show the comment and why you think it's wrong (file:line or [top-level] + body summary + permalink URL) - Options: - A) Reply to Greptile explaining the false positive (recommended if clearly wrong) - B) Fix it anyway (if trivial) - C) Ignore silently - If user chooses A: reply using the **False Positive reply template** from greptile-triage.md (include evidence + suggested re-rank), save to both per-project and global greptile-history (type: fp) **SUPPRESSED:** Skip silently — these are known false positives from previous triage. **After all comments are resolved:** If any fixes were applied, the tests from Step 3 are now stale. **Re-run tests** (Step 3) before continuing to Step 4. If no fixes were applied, continue to Step 4. --- ## Step 4: Version bump (auto-decide) 1. Read the current `VERSION` file (4-digit format: `MAJOR.MINOR.PATCH.MICRO`) 2. **Auto-decide the bump level based on the diff:** - Count lines changed (`git diff origin/...HEAD --stat | tail -1`) - **MICRO** (4th digit): < 50 lines changed, trivial tweaks, typos, config - **PATCH** (3rd digit): 50+ lines changed, bug fixes, small-medium features - **MINOR** (2nd digit): **ASK the user** — only for major features or significant architectural changes - **MAJOR** (1st digit): **ASK the user** — only for milestones or breaking changes 3. Compute the new version: - Bumping a digit resets all digits to its right to 0 - Example: `0.19.1.0` + PATCH → `0.19.2.0` 4. Write the new version to the `VERSION` file. --- ## Step 5: CHANGELOG (auto-generate) 1. Read `CHANGELOG.md` header to know the format. 2. Auto-generate the entry from **ALL commits on the branch** (not just recent ones): - Use `git log ..HEAD --oneline` to see every commit being shipped - Use `git diff ...HEAD` to see the full diff against the base branch - The CHANGELOG entry must be comprehensive of ALL changes going into the PR - If existing CHANGELOG entries on the branch already cover some commits, replace them with one unified entry for the new version - Categorize changes into applicable sections: - `### Added` — new features - `### Changed` — changes to existing functionality - `### Fixed` — bug fixes - `### Removed` — removed features - Write concise, descriptive bullet points - Insert after the file header (line 5), dated today - Format: `## [X.Y.Z.W] - YYYY-MM-DD` **Do NOT ask the user to describe changes.** Infer from the diff and commit history. --- ## Step 5.5: TODOS.md (auto-update) Cross-reference the project's TODOS.md against the changes being shipped. Mark completed items automatically; prompt only if the file is missing or disorganized. Read `.claude/skills/review/TODOS-format.md` for the canonical format reference. **1. Check if TODOS.md exists** in the repository root. **If TODOS.md does not exist:** Use AskUserQuestion: - Message: "GStack recommends maintaining a TODOS.md organized by skill/component, then priority (P0 at top through P4, then Completed at bottom). See TODOS-format.md for the full format. Would you like to create one?" - Options: A) Create it now, B) Skip for now - If A: Create `TODOS.md` with a skeleton (# TODOS heading + ## Completed section). Continue to step 3. - If B: Skip the rest of Step 5.5. Continue to Step 6. **2. Check structure and organization:** Read TODOS.md and verify it follows the recommended structure: - Items grouped under `## ` headings - Each item has `**Priority:**` field with P0-P4 value - A `## Completed` section at the bottom **If disorganized** (missing priority fields, no component groupings, no Completed section): Use AskUserQuestion: - Message: "TODOS.md doesn't follow the recommended structure (skill/component groupings, P0-P4 priority, Completed section). Would you like to reorganize it?" - Options: A) Reorganize now (recommended), B) Leave as-is - If A: Reorganize in-place following TODOS-format.md. Preserve all content — only restructure, never delete items. - If B: Continue to step 3 without restructuring. **3. Detect completed TODOs:** This step is fully automatic — no user interaction. Use the diff and commit history already gathered in earlier steps: - `git diff ...HEAD` (full diff against the base branch) - `git log ..HEAD --oneline` (all commits being shipped) For each TODO item, check if the changes in this PR complete it by: - Matching commit messages against the TODO title and description - Checking if files referenced in the TODO appear in the diff - Checking if the TODO's described work matches the functional changes **Be conservative:** Only mark a TODO as completed if there is clear evidence in the diff. If uncertain, leave it alone. **4. Move completed items** to the `## Completed` section at the bottom. Append: `**Completed:** vX.Y.Z (YYYY-MM-DD)` **5. Output summary:** - `TODOS.md: N items marked complete (item1, item2, ...). M items remaining.` - Or: `TODOS.md: No completed items detected. M items remaining.` - Or: `TODOS.md: Created.` / `TODOS.md: Reorganized.` **6. Defensive:** If TODOS.md cannot be written (permission error, disk full), warn the user and continue. Never stop the ship workflow for a TODOS failure. Save this summary — it goes into the PR body in Step 8. --- ## Step 6: Commit (bisectable chunks) **Goal:** Create small, logical commits that work well with `git bisect` and help LLMs understand what changed. 1. Analyze the diff and group changes into logical commits. Each commit should represent **one coherent change** — not one file, but one logical unit. 2. **Commit ordering** (earlier commits first): - **Infrastructure:** migrations, config changes, route additions - **Models & services:** new models, services, concerns (with their tests) - **Controllers & views:** controllers, views, JS/React components (with their tests) - **VERSION + CHANGELOG + TODOS.md:** always in the final commit 3. **Rules for splitting:** - A model and its test file go in the same commit - A service and its test file go in the same commit - A controller, its views, and its test go in the same commit - Migrations are their own commit (or grouped with the model they support) - Config/route changes can group with the feature they enable - If the total diff is small (< 50 lines across < 4 files), a single commit is fine 4. **Each commit must be independently valid** — no broken imports, no references to code that doesn't exist yet. Order commits so dependencies come first. 5. Compose each commit message: - First line: `: ` (type = feat/fix/chore/refactor/docs) - Body: brief description of what this commit contains - Only the **final commit** (VERSION + CHANGELOG) gets the version tag and co-author trailer: ```bash git commit -m "$(cat <<'EOF' chore: bump version and changelog (vX.Y.Z.W) Co-Authored-By: Claude Opus 4.6 EOF )" ``` --- ## Step 7: Push Push to the remote with upstream tracking: ```bash git push -u origin ``` --- ## Step 8: Create PR Create a pull request using `gh`: ```bash gh pr create --base --title ": " --body "$(cat <<'EOF' ## Summary ## Test Coverage ## Pre-Landing Review ## Design Review ## Eval Results ## Greptile Review ## TODOS ## Test plan - [x] All Rails tests pass (N runs, 0 failures) - [x] All Vitest tests pass (N tests) 🤖 Generated with [Claude Code](https://claude.com/claude-code) EOF )" ``` **Output the PR URL** — this should be the final output the user sees. --- ## Important Rules - **Never skip tests.** If tests fail, stop. - **Never skip the pre-landing review.** If checklist.md is unreadable, stop. - **Never force push.** Use regular `git push` only. - **Never ask for confirmation** except for MINOR/MAJOR version bumps and pre-landing review ASK items (batched into at most one AskUserQuestion). - **Always use the 4-digit version format** from the VERSION file. - **Date format in CHANGELOG:** `YYYY-MM-DD` - **Split commits for bisectability** — each commit = one logical change. - **TODOS.md completion detection must be conservative.** Only mark items as completed when the diff clearly shows the work is done. - **Use Greptile reply templates from greptile-triage.md.** Every reply includes evidence (inline diff, code references, re-rank suggestion). Never post vague replies. - **Step 3.4 generates coverage tests.** They must pass before committing. Never commit failing tests. - **The goal is: user says `/ship`, next thing they see is the review + PR URL.**