~cytrogen/gstack

a4a181ca92028359403d9f92f6f527db48244ee6 — Garry Tan 9 days ago a0328be
feat: Review Army — parallel specialist reviewers for /review (v0.14.3.0) (#692)

* feat: extend gstack-diff-scope with SCOPE_MIGRATIONS, SCOPE_API, SCOPE_AUTH

Three new scope signals for Review Army specialist activation:
- SCOPE_MIGRATIONS: db/migrate/, prisma/migrations/, alembic/, *.sql
- SCOPE_API: *controller*, *route*, *endpoint*, *.graphql, openapi.*
- SCOPE_AUTH: *auth*, *session*, *jwt*, *oauth*, *permission*, *role*

* feat: add 7 specialist checklist files for Review Army

- testing.md (always-on): coverage gaps, flaky patterns, security enforcement
- maintainability.md (always-on): dead code, DRY, stale comments
- security.md (conditional): OWASP deep analysis, auth bypass, injection
- performance.md (conditional): N+1 queries, bundle impact, complexity
- data-migration.md (conditional): reversibility, lock duration, backfill
- api-contract.md (conditional): breaking changes, versioning, error format
- red-team.md (conditional): adversarial analysis, cross-cutting concerns

All use standard header with JSON output schema and NO FINDINGS fallback.

* feat: Review Army resolver — parallel specialist dispatch + merge

New resolver in review-army.ts generates template prose for:
- Stack detection and specialist selection
- Parallel Agent tool dispatch with learning-informed prompts
- JSON finding collection, fingerprint dedup, consensus highlighting
- PR quality score computation
- Red Team conditional dispatch

Registered as REVIEW_ARMY in resolvers/index.ts.

* refactor: restructure /review template for Review Army

- Replace Steps 4-4.75 with CRITICAL pass + {{REVIEW_ARMY}}
- Remove {{DESIGN_REVIEW_LITE}} and {{TEST_COVERAGE_AUDIT_REVIEW}}
  (subsumed into Design and Testing specialists respectively)
- Extract specialist-covered categories from checklist.md
- Keep CRITICAL + uncovered INFORMATIONAL in main agent pass

* test: Review Army — 14 diff-scope tests + 7 E2E tests

- test/diff-scope.test.ts: 14 tests for all 9 scope signals
- test/skill-e2e-review-army.test.ts: 7 E2E tests
  Gate: migration safety, N+1 detection, delivery audit,
        quality score, JSON findings
  Periodic: red team, consensus
- Updated gen-skill-docs tests for new review structure
- Added touchfile entries and tier classifications

* docs: update SELF_LEARNING_V0.md with Release 2 status + Release 2.5

Mark Release 2 (Review Army) as in-progress. Add Release 2.5 for
deferred expansions (E1 adaptive gating, E3 test stubs, E5 cross-review
dedup, E7 specialist tracking).

* chore: bump version and changelog (v0.14.3.0)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
M CHANGELOG.md => CHANGELOG.md +20 -0
@@ 1,5 1,25 @@
# Changelog

## [0.14.4.0] - 2026-03-31 — Review Army: Parallel Specialist Reviewers

Every `/review` now dispatches specialist subagents in parallel. Instead of one agent applying one giant checklist, you get focused reviewers for testing gaps, maintainability, security, performance, data migrations, API contracts, and adversarial red-teaming. Each specialist reads the diff independently with fresh context, outputs structured JSON findings, and the main agent merges, deduplicates, and boosts confidence when multiple specialists flag the same issue. Small diffs (<50 lines) skip specialists entirely for speed. Large diffs (200+ lines) activate the Red Team for adversarial analysis on top.

### Added

- **7 specialist reviewers** running in parallel via Agent tool subagents. Always-on: Testing + Maintainability. Conditional: Security (auth scope), Performance (backend/frontend), Data Migration (migration files), API Contract (controllers/routes), Red Team (large diffs or critical findings).
- **JSON finding schema.** Specialists output structured JSON objects with severity, confidence, path, line, category, fix, and fingerprint fields. Reliable parsing, no more pipe-delimited text.
- **Fingerprint-based dedup.** When two specialists flag the same file:line:category, the finding gets boosted confidence and a "MULTI-SPECIALIST CONFIRMED" marker.
- **PR Quality Score.** Every review computes a 0-10 quality score: `10 - (critical * 2 + informational * 0.5)`. Logged to review history for trending via `/retro`.
- **3 new diff-scope signals.** `gstack-diff-scope` now detects SCOPE_MIGRATIONS, SCOPE_API, and SCOPE_AUTH to activate the right specialists.
- **Learning-informed specialist prompts.** Each specialist gets past learnings for its domain injected into the prompt, so reviews get smarter over time.
- **14 new diff-scope tests** covering all 9 scope signals including the 3 new ones.
- **7 new E2E tests** (5 gate, 2 periodic) covering migration safety, N+1 detection, delivery audit, quality score, JSON schema compliance, red team activation, and multi-specialist consensus.

### Changed

- **Review checklist refactored.** Categories now covered by specialists (test gaps, dead code, magic numbers, performance, crypto) removed from the main checklist. Main agent focuses on CRITICAL pass only.
- **Delivery Integrity enhanced.** The existing plan completion audit now investigates WHY items are missing (not just that they're missing) and logs plan-file discrepancies as learnings. Commit-message inference is informational only, never persisted.

## [0.14.3.0] - 2026-03-31 — Always-On Adversarial Review + Scope Drift + Plan Mode Design Tools

Every code review now runs adversarial analysis from both Claude and Codex, regardless of diff size. A 5-line auth change gets the same cross-model scrutiny as a 500-line feature. The old "skip adversarial for small diffs" heuristic is gone... diff size was never a good proxy for risk.

M VERSION => VERSION +1 -1
@@ 1,1 1,1 @@
0.14.3.0
0.14.4.0

M bin/gstack-diff-scope => bin/gstack-diff-scope +19 -0
@@ 16,6 16,9 @@ if [ -z "$FILES" ]; then
  echo "SCOPE_TESTS=false"
  echo "SCOPE_DOCS=false"
  echo "SCOPE_CONFIG=false"
  echo "SCOPE_MIGRATIONS=false"
  echo "SCOPE_API=false"
  echo "SCOPE_AUTH=false"
  exit 0
fi



@@ 25,6 28,9 @@ PROMPTS=false
TESTS=false
DOCS=false
CONFIG=false
MIGRATIONS=false
API=false
AUTH=false

while IFS= read -r f; do
  case "$f" in


@@ 57,6 63,16 @@ while IFS= read -r f; do
    .github/*) CONFIG=true ;;
    requirements.txt|pyproject.toml|go.mod|Cargo.toml|composer.json) CONFIG=true ;;

    # Migrations: database migration files
    db/migrate/*|*/migrations/*|alembic/*|prisma/migrations/*) MIGRATIONS=true ;;

    # API: routes, controllers, endpoints, GraphQL/OpenAPI schemas
    *controller*|*route*|*endpoint*|*/api/*) API=true ;;
    *.graphql|*.gql|openapi.*|swagger.*) API=true ;;

    # Auth: authentication, authorization, sessions, permissions
    *auth*|*session*|*jwt*|*oauth*|*permission*|*role*) AUTH=true ;;

    # Backend: everything else that's code (excluding views/components already matched)
    *.rb|*.py|*.go|*.rs|*.java|*.php|*.ex|*.exs) BACKEND=true ;;
    *.ts|*.js) BACKEND=true ;;  # Non-component TS/JS is backend


@@ 69,3 85,6 @@ echo "SCOPE_PROMPTS=$PROMPTS"
echo "SCOPE_TESTS=$TESTS"
echo "SCOPE_DOCS=$DOCS"
echo "SCOPE_CONFIG=$CONFIG"
echo "SCOPE_MIGRATIONS=$MIGRATIONS"
echo "SCOPE_API=$API"
echo "SCOPE_AUTH=$AUTH"

M docs/designs/SELF_LEARNING_V0.md => docs/designs/SELF_LEARNING_V0.md +29 -5
@@ 91,11 91,35 @@ gstack-review-log pattern.
**Headline:** 10 specialist reviewers on every PR.

What ships:
- Parallel review agents: always-on (correctness, testing, maintainability) +
  conditional (security, performance, API, data-migrations, reliability) +
  stack-specific (Rails, TypeScript, Python, frontend-races)
- Red team reviewer activated for large diffs and high-risk domains
- Structured findings with confidence scores + merge/dedup across agents
- 7 parallel specialist subagents: always-on (testing, maintainability) +
  conditional (security, performance, data-migration, API contract, design) +
  red team (large diffs / critical findings)
- JSON-structured findings with confidence scores + fingerprint dedup across agents
- PR quality score (0-10) logged per review + /retro trending (E2)
- Learning-informed specialist prompts — past pitfalls injected per domain (E4)
- Multi-specialist consensus highlighting — confirmed findings get boosted (E6)
- Enhanced Delivery Integrity via PLAN_COMPLETION_AUDIT — investigation depth,
  commit message fallback, plan-file learnings logging
- Checklist refactored: CRITICAL categories stay in main pass, specialist
  categories extracted to focused checklists in review/specialists/

### Release 2.5: "Review Army Expansions" (v0.15.x)

**Headline:** Ship after R2 proves stable. Check in on how the core loop is performing.

Pre-check: review R2 quality metrics (PR quality scores, specialist hit rates,
false positive rates, E2E test stability). If core loop has issues, fix those first.

What ships:
- E1: Adaptive specialist gating — auto-skip specialists with 0-finding track record.
  Store per-project hit rates via gstack-learnings-log. User can force with --security etc.
- E3: Test stub generation — each specialist outputs TEST_STUB alongside findings.
  Framework detected from project (Jest/Vitest/RSpec/pytest/Go test).
  Flows into Fix-First: AUTO-FIX applies fix + creates test file.
- E5: Cross-review finding dedup — read gstack-review-read for prior review entries.
  Suppress findings matching a prior user-skipped finding.
- E7: Specialist performance tracking — log per-specialist metrics via gstack-review-log.
  /retro integration: "Top finding specialist: Performance (7 findings)."

### Release 3: "Smart Ceremony" (v0.16)


M package.json => package.json +1 -1
@@ 1,6 1,6 @@
{
  "name": "gstack",
  "version": "0.14.2.0",
  "version": "0.14.4.0",
  "description": "Garry's Stack — Claude Code skills + fast headless browser. One repo, one install, entire AI engineering workflow.",
  "license": "MIT",
  "type": "module",

M review/SKILL.md => review/SKILL.md +172 -208
@@ 608,14 608,69 @@ COMPLETION: 4/7 DONE, 1 PARTIAL, 1 NOT DONE, 1 CHANGED
─────────────────────────────────
```

### Fallback Intent Sources (when no plan file found)

When no plan file is detected, use these secondary intent sources:

1. **Commit messages:** Run `git log origin/<base>..HEAD --oneline`. Use judgment to extract real intent:
   - Commits with actionable verbs ("add", "implement", "fix", "create", "remove", "update") are intent signals
   - Skip noise: "WIP", "tmp", "squash", "merge", "chore", "typo", "fixup"
   - Extract the intent behind the commit, not the literal message
2. **TODOS.md:** If it exists, check for items related to this branch or recent dates
3. **PR description:** Run `gh pr view --json body -q .body 2>/dev/null` for intent context

**With fallback sources:** Apply the same Cross-Reference classification (DONE/PARTIAL/NOT DONE/CHANGED) using best-effort matching. Note that fallback-sourced items are lower confidence than plan-file items.

### Investigation Depth

For each PARTIAL or NOT DONE item, investigate WHY:

1. Check `git log origin/<base>..HEAD --oneline` for commits that suggest the work was started, attempted, or reverted
2. Read the relevant code to understand what was built instead
3. Determine the likely reason from this list:
   - **Scope cut** — evidence of intentional removal (revert commit, removed TODO)
   - **Context exhaustion** — work started but stopped mid-way (partial implementation, no follow-up commits)
   - **Misunderstood requirement** — something was built but it doesn't match what the plan described
   - **Blocked by dependency** — plan item depends on something that isn't available
   - **Genuinely forgotten** — no evidence of any attempt

Output for each discrepancy:
```
DISCREPANCY: {PARTIAL|NOT_DONE} | {plan item} | {what was actually delivered}
INVESTIGATION: {likely reason with evidence from git log / code}
IMPACT: {HIGH|MEDIUM|LOW} — {what breaks or degrades if this stays undelivered}
```

### Learnings Logging (plan-file discrepancies only)

**Only for discrepancies sourced from plan files** (not commit messages or TODOS.md), log a learning so future sessions know this pattern occurred:

```bash
~/.claude/skills/gstack/bin/gstack-learnings-log '{
  "type": "pitfall",
  "key": "plan-delivery-gap-KEBAB_SUMMARY",
  "insight": "Planned X but delivered Y because Z",
  "confidence": 8,
  "source": "observed",
  "files": ["PLAN_FILE_PATH"]
}'
```

Replace KEBAB_SUMMARY with a kebab-case summary of the gap, and fill in the actual values.

**Do NOT log learnings from commit-message-derived or TODOS.md-derived discrepancies.** These are informational in the review output but too noisy for durable memory.

### Integration with Scope Drift Detection

The plan completion results augment the existing Scope Drift Detection. If a plan file is found:

- **NOT DONE items** become additional evidence for **MISSING REQUIREMENTS** in the scope drift report.
- **Items in the diff that don't match any plan item** become evidence for **SCOPE CREEP** detection.
- **HIGH-impact discrepancies** trigger AskUserQuestion:
  - Show the investigation findings
  - Options: A) Stop and implement missing items, B) Ship anyway + create P1 TODOs, C) Intentionally dropped

This is **INFORMATIONAL** — does not block the review (consistent with existing scope drift behavior).
This is **INFORMATIONAL** unless HIGH-impact discrepancies are found (then it gates via AskUserQuestion).

Update the scope drift output to include plan file context:



@@ 625,11 680,11 @@ Intent: <from plan file — 1-line summary>
Plan: <plan file path>
Delivered: <1-line summary of what the diff actually does>
Plan items: N DONE, M PARTIAL, K NOT DONE
[If NOT DONE: list each missing item]
[If NOT DONE: list each missing item with investigation]
[If scope creep: list each out-of-scope change not in the plan]
```

**No plan file found:** Fall back to existing scope drift behavior (check TODOS.md and PR description only).
**No plan file found:** Use commit messages and TODOS.md as fallback sources (see above). If no intent sources at all, skip with: "No intent sources detected — skipping completion audit."

## Step 2: Read the checklist



@@ 699,12 754,12 @@ matches a past learning, display:
This makes the compounding visible. The user should see that gstack is getting
smarter on their codebase over time.

## Step 4: Two-pass review
## Step 4: Critical pass (core review)

Apply the checklist against the diff in two passes:
Apply the CRITICAL categories from the checklist against the diff:
SQL & Data Safety, Race Conditions & Concurrency, LLM Output Trust Boundary, Shell Injection, Enum & Value Completeness.

1. **Pass 1 (CRITICAL):** SQL & Data Safety, Race Conditions & Concurrency, LLM Output Trust Boundary, Enum & Value Completeness
2. **Pass 2 (INFORMATIONAL):** Conditional Side Effects, Magic Numbers & String Coupling, Dead Code & Consistency, LLM Prompt Issues, Test Gaps, View/Frontend, Performance & Bundle Impact
Also apply the remaining INFORMATIONAL categories that are still in the checklist (Async/Sync Mixing, Column/Field Name Safety, LLM Prompt Issues, Type Coercion, View/Frontend, Time Window Safety, Completeness Gaps, Distribution & CI/CD).

**Enum & Value Completeness requires reading code OUTSIDE the diff.** When the diff introduces a new enum value, status, tier, or type constant, use Grep to find all files that reference sibling values, then Read those files to check if the new value is handled. This is the one category where within-diff review is insufficient.



@@ 744,258 799,167 @@ higher confidence.

---

## Step 4.5: Design Review (conditional)
## Step 4.5: Review Army — Specialist Dispatch

## Design Review (conditional, diff-scoped)

Check if the diff touches frontend files using `gstack-diff-scope`:
### Detect stack and scope

```bash
source <(~/.claude/skills/gstack/bin/gstack-diff-scope <base> 2>/dev/null)
source <(~/.claude/skills/gstack/bin/gstack-diff-scope <base> 2>/dev/null) || true
# Detect stack for specialist context
STACK=""
[ -f Gemfile ] && STACK="${STACK}ruby "
[ -f package.json ] && STACK="${STACK}node "
[ -f requirements.txt ] || [ -f pyproject.toml ] && STACK="${STACK}python "
[ -f go.mod ] && STACK="${STACK}go "
[ -f Cargo.toml ] && STACK="${STACK}rust "
echo "STACK: ${STACK:-unknown}"
DIFF_LINES=$(git diff origin/<base> --stat | tail -1 | grep -oE '[0-9]+ insertion' | grep -oE '[0-9]+' || echo "0")
echo "DIFF_LINES: $DIFF_LINES"
```

**If `SCOPE_FRONTEND=false`:** Skip design review silently. No output.

**If `SCOPE_FRONTEND=true`:**

1. **Check for DESIGN.md.** If `DESIGN.md` or `design-system.md` exists in the repo root, read it. All design findings are calibrated against it — patterns blessed in DESIGN.md are not flagged. If not found, use universal design principles.

2. **Read `.claude/skills/review/design-checklist.md`.** If the file cannot be read, skip design review with a note: "Design checklist not found — skipping design review."

3. **Read each changed frontend file** (full file, not just diff hunks). Frontend files are identified by the patterns listed in the checklist.
### Select specialists

4. **Apply the design checklist** against the changed files. For each item:
   - **[HIGH] mechanical CSS fix** (`outline: none`, `!important`, `font-size < 16px`): classify as AUTO-FIX
   - **[HIGH/MEDIUM] design judgment needed**: classify as ASK
   - **[LOW] intent-based detection**: present as "Possible — verify visually or run /design-review"
Based on the scope signals above, select which specialists to dispatch.

5. **Include findings** in the review output under a "Design Review" header, following the output format in the checklist. Design findings merge with code review findings into the same Fix-First flow.
**Always-on (dispatch on every review with 50+ changed lines):**
1. **Testing** — read `~/.claude/skills/gstack/review/specialists/testing.md`
2. **Maintainability** — read `~/.claude/skills/gstack/review/specialists/maintainability.md`

6. **Log the result** for the Review Readiness Dashboard:
**If DIFF_LINES < 50:** Skip all specialists. Print: "Small diff ($DIFF_LINES lines) — specialists skipped." Continue to Step 5.

```bash
~/.claude/skills/gstack/bin/gstack-review-log '{"skill":"design-review-lite","timestamp":"TIMESTAMP","status":"STATUS","findings":N,"auto_fixed":M,"commit":"COMMIT"}'
```

Substitute: TIMESTAMP = ISO 8601 datetime, STATUS = "clean" if 0 findings or "issues_found", N = total findings, M = auto-fixed count, COMMIT = output of `git rev-parse --short HEAD`.

7. **Codex design voice** (optional, automatic if available):
**Conditional (dispatch if the matching scope signal is true):**
3. **Security** — if SCOPE_AUTH=true, OR if SCOPE_BACKEND=true AND DIFF_LINES > 100. Read `~/.claude/skills/gstack/review/specialists/security.md`
4. **Performance** — if SCOPE_BACKEND=true OR SCOPE_FRONTEND=true. Read `~/.claude/skills/gstack/review/specialists/performance.md`
5. **Data Migration** — if SCOPE_MIGRATIONS=true. Read `~/.claude/skills/gstack/review/specialists/data-migration.md`
6. **API Contract** — if SCOPE_API=true. Read `~/.claude/skills/gstack/review/specialists/api-contract.md`
7. **Design** — if SCOPE_FRONTEND=true. Use the existing design review checklist at `~/.claude/skills/gstack/review/design-checklist.md`

```bash
which codex 2>/dev/null && echo "CODEX_AVAILABLE" || echo "CODEX_NOT_AVAILABLE"
```

If Codex is available, run a lightweight design check on the diff:

```bash
TMPERR_DRL=$(mktemp /tmp/codex-drl-XXXXXXXX)
_REPO_ROOT=$(git rev-parse --show-toplevel) || { echo "ERROR: not in a git repo" >&2; exit 1; }
codex exec "Review the git diff on this branch. Run 7 litmus checks (YES/NO each): 1. Brand/product unmistakable in first screen? 2. One strong visual anchor present? 3. Page understandable by scanning headlines only? 4. Each section has one job? 5. Are cards actually necessary? 6. Does motion improve hierarchy or atmosphere? 7. Would design feel premium with all decorative shadows removed? Flag any hard rejections: 1. Generic SaaS card grid as first impression 2. Beautiful image with weak brand 3. Strong headline with no clear action 4. Busy imagery behind text 5. Sections repeating same mood statement 6. Carousel with no narrative purpose 7. App UI made of stacked cards instead of layout 5 most important design findings only. Reference file:line." -C "$_REPO_ROOT" -s read-only -c 'model_reasoning_effort="high"' --enable web_search_cached 2>"$TMPERR_DRL"
```

Use a 5-minute timeout (`timeout: 300000`). After the command completes, read stderr:
```bash
cat "$TMPERR_DRL" && rm -f "$TMPERR_DRL"
```

**Error handling:** All errors are non-blocking. On auth failure, timeout, or empty response — skip with a brief note and continue.

Present Codex output under a `CODEX (design):` header, merged with the checklist findings above.

Include any design findings alongside the findings from Step 4. They follow the same Fix-First flow in Step 5 — AUTO-FIX for mechanical CSS fixes, ASK for everything else.
Note which specialists were selected and which were skipped. Print the selection:
"Dispatching N specialists: [names]. Skipped: [names] (scope not detected)."

---

## Step 4.75: Test Coverage Diagram
### Dispatch specialists in parallel

100% coverage is the goal. Evaluate every codepath changed in the diff and identify test gaps. Gaps become INFORMATIONAL findings that follow the Fix-First flow.
For each selected specialist, launch an independent subagent via the Agent tool.
**Launch ALL selected specialists in a single message** (multiple Agent tool calls)
so they run in parallel. Each subagent has fresh context — no prior review bias.

### Test Framework Detection
**Each specialist subagent prompt:**

Before analyzing coverage, detect the project's test framework:
Construct the prompt for each specialist. The prompt includes:

1. **Read CLAUDE.md** — look for a `## Testing` section with test command and framework name. If found, use that as the authoritative source.
2. **If CLAUDE.md has no testing section, auto-detect:**
1. The specialist's checklist content (you already read the file above)
2. Stack context: "This is a {STACK} project."
3. Past learnings for this domain (if any exist):

```bash
setopt +o nomatch 2>/dev/null || true  # zsh compat
# Detect project runtime
[ -f Gemfile ] && echo "RUNTIME:ruby"
[ -f package.json ] && echo "RUNTIME:node"
[ -f requirements.txt ] || [ -f pyproject.toml ] && echo "RUNTIME:python"
[ -f go.mod ] && echo "RUNTIME:go"
[ -f Cargo.toml ] && echo "RUNTIME:rust"
# Check for existing test infrastructure
ls jest.config.* vitest.config.* playwright.config.* cypress.config.* .rspec pytest.ini phpunit.xml 2>/dev/null
ls -d test/ tests/ spec/ __tests__/ cypress/ e2e/ 2>/dev/null
~/.claude/skills/gstack/bin/gstack-learnings-search --type pitfall --query "{specialist domain}" --limit 5 2>/dev/null || true
```

3. **If no framework detected:** still produce the coverage diagram, but skip test generation.

**Step 1. Trace every codepath changed** using `git diff origin/<base>...HEAD`:

Read every changed file. For each one, trace how data flows through the code — don't just list functions, actually follow the execution:
If learnings are found, include them: "Past learnings for this domain: {learnings}"

1. **Read the diff.** For each changed file, read the full file (not just the diff hunk) to understand context.
2. **Trace data flow.** Starting from each entry point (route handler, exported function, event listener, component render), follow the data through every branch:
   - Where does input come from? (request params, props, database, API call)
   - What transforms it? (validation, mapping, computation)
   - Where does it go? (database write, API response, rendered output, side effect)
   - What can go wrong at each step? (null/undefined, invalid input, network failure, empty collection)
3. **Diagram the execution.** For each changed file, draw an ASCII diagram showing:
   - Every function/method that was added or modified
   - Every conditional branch (if/else, switch, ternary, guard clause, early return)
   - Every error path (try/catch, rescue, error boundary, fallback)
   - Every call to another function (trace into it — does IT have untested branches?)
   - Every edge: what happens with null input? Empty array? Invalid type?
4. Instructions:

This is the critical step — you're building a map of every line of code that can execute differently based on input. Every branch in this diagram needs a test.
"You are a specialist code reviewer. Read the checklist below, then run
`git diff origin/<base>` to get the full diff. Apply the checklist against the diff.

**Step 2. Map user flows, interactions, and error states:**
For each finding, output a JSON object on its own line:
{\"severity\":\"CRITICAL|INFORMATIONAL\",\"confidence\":N,\"path\":\"file\",\"line\":N,\"category\":\"category\",\"summary\":\"description\",\"fix\":\"recommended fix\",\"fingerprint\":\"path:line:category\",\"specialist\":\"name\"}

Code coverage isn't enough — you need to cover how real users interact with the changed code. For each changed feature, think through:
Required fields: severity, confidence, path, category, summary, specialist.
Optional: line, fix, fingerprint, evidence.

- **User flows:** What sequence of actions does a user take that touches this code? Map the full journey (e.g., "user clicks 'Pay' → form validates → API call → success/failure screen"). Each step in the journey needs a test.
- **Interaction edge cases:** What happens when the user does something unexpected?
  - Double-click/rapid resubmit
  - Navigate away mid-operation (back button, close tab, click another link)
  - Submit with stale data (page sat open for 30 minutes, session expired)
  - Slow connection (API takes 10 seconds — what does the user see?)
  - Concurrent actions (two tabs, same form)
- **Error states the user can see:** For every error the code handles, what does the user actually experience?
  - Is there a clear error message or a silent failure?
  - Can the user recover (retry, go back, fix input) or are they stuck?
  - What happens with no network? With a 500 from the API? With invalid data from the server?
- **Empty/zero/boundary states:** What does the UI show with zero results? With 10,000 results? With a single character input? With maximum-length input?
If no findings: output `NO FINDINGS` and nothing else.
Do not output anything else — no preamble, no summary, no commentary.

Add these to your diagram alongside the code branches. A user flow with no test is just as much a gap as an untested if/else.
Stack context: {STACK}
Past learnings: {learnings or 'none'}

**Step 3. Check each branch against existing tests:**
CHECKLIST:
{checklist content}"

Go through your diagram branch by branch — both code paths AND user flows. For each one, search for a test that exercises it:
- Function `processPayment()` → look for `billing.test.ts`, `billing.spec.ts`, `test/billing_test.rb`
- An if/else → look for tests covering BOTH the true AND false path
- An error handler → look for a test that triggers that specific error condition
- A call to `helperFn()` that has its own branches → those branches need tests too
- A user flow → look for an integration or E2E test that walks through the journey
- An interaction edge case → look for a test that simulates the unexpected action
**Subagent configuration:**
- Use `subagent_type: "general-purpose"`
- Do NOT use `run_in_background` — all specialists must complete before merge
- If any specialist subagent fails or times out, log the failure and continue with results from successful specialists. Specialists are additive — partial results are better than no results.

Quality scoring rubric:
- ★★★  Tests behavior with edge cases AND error paths
- ★★   Tests correct behavior, happy path only
- ★    Smoke test / existence check / trivial assertion (e.g., "it renders", "it doesn't throw")

### E2E Test Decision Matrix

When checking each branch, also determine whether a unit test or E2E/integration test is the right tool:

**RECOMMEND E2E (mark as [→E2E] in the diagram):**
- Common user flow spanning 3+ components/services (e.g., signup → verify email → first login)
- Integration point where mocking hides real failures (e.g., API → queue → worker → DB)
- Auth/payment/data-destruction flows — too important to trust unit tests alone

**RECOMMEND EVAL (mark as [→EVAL] in the diagram):**
- Critical LLM call that needs a quality eval (e.g., prompt change → test output still meets quality bar)
- Changes to prompt templates, system instructions, or tool definitions

**STICK WITH UNIT TESTS:**
- Pure function with clear inputs/outputs
- Internal helper with no side effects
- Edge case of a single function (null input, empty array)
- Obscure/rare flow that isn't customer-facing
---

### REGRESSION RULE (mandatory)
### Step 4.6: Collect and merge findings

**IRON RULE:** When the coverage audit identifies a REGRESSION — code that previously worked but the diff broke — a regression test is written immediately. No AskUserQuestion. No skipping. Regressions are the highest-priority test because they prove something broke.
After all specialist subagents complete, collect their outputs.

A regression is when:
- The diff modifies existing behavior (not new code)
- The existing test suite (if any) doesn't cover the changed path
- The change introduces a new failure mode for existing callers
**Parse findings:**
For each specialist's output:
1. If output is "NO FINDINGS" — skip, this specialist found nothing
2. Otherwise, parse each line as a JSON object. Skip lines that are not valid JSON.
3. Collect all parsed findings into a single list, tagged with their specialist name.

When uncertain whether a change is a regression, err on the side of writing the test.
**Fingerprint and deduplicate:**
For each finding, compute its fingerprint:
- If `fingerprint` field is present, use it
- Otherwise: `{path}:{line}:{category}` (if line is present) or `{path}:{category}`

Format: commit as `test: regression test for {what broke}`
Group findings by fingerprint. For findings sharing the same fingerprint:
- Keep the finding with the highest confidence score
- Tag it: "MULTI-SPECIALIST CONFIRMED ({specialist1} + {specialist2})"
- Boost confidence by +1 (cap at 10)
- Note the confirming specialists in the output

**Step 4. Output ASCII coverage diagram:**
**Apply confidence gates:**
- Confidence 7+: show normally in the findings output
- Confidence 5-6: show with caveat "Medium confidence — verify this is actually an issue"
- Confidence 3-4: move to appendix (suppress from main findings)
- Confidence 1-2: suppress entirely

Include BOTH code paths and user flows in the same diagram. Mark E2E-worthy and eval-worthy paths:
**Compute PR Quality Score:**
After merging, compute the quality score:
`quality_score = max(0, 10 - (critical_count * 2 + informational_count * 0.5))`
Cap at 10. Log this in the review result at the end.

```
CODE PATH COVERAGE
===========================
[+] src/services/billing.ts
    ├── processPayment()
    │   ├── [★★★ TESTED] Happy path + card declined + timeout — billing.test.ts:42
    │   ├── [GAP]         Network timeout — NO TEST
    │   └── [GAP]         Invalid currency — NO TEST
    └── refundPayment()
        ├── [★★  TESTED] Full refund — billing.test.ts:89
        └── [★   TESTED] Partial refund (checks non-throw only) — billing.test.ts:101

USER FLOW COVERAGE
===========================
[+] Payment checkout flow
    ├── [★★★ TESTED] Complete purchase — checkout.e2e.ts:15
    ├── [GAP] [→E2E] Double-click submit — needs E2E, not just unit
    ├── [GAP]         Navigate away during payment — unit test sufficient
    └── [★   TESTED]  Form validation errors (checks render only) — checkout.test.ts:40

[+] Error states
    ├── [★★  TESTED] Card declined message — billing.test.ts:58
    ├── [GAP]         Network timeout UX (what does user see?) — NO TEST
    └── [GAP]         Empty cart submission — NO TEST

[+] LLM integration
    └── [GAP] [→EVAL] Prompt template change — needs eval test
**Output merged findings:**
Present the merged findings in the same format as the current review:

─────────────────────────────────
COVERAGE: 5/13 paths tested (38%)
  Code paths: 3/5 (60%)
  User flows: 2/8 (25%)
QUALITY:  ★★★: 2  ★★: 2  ★: 1
GAPS: 8 paths need tests (2 need E2E, 1 needs eval)
─────────────────────────────────
```
SPECIALIST REVIEW: N findings (X critical, Y informational) from Z specialists

**Fast path:** All paths covered → "Step 4.75: All new code paths have test coverage ✓" Continue.
[For each finding, in order: CRITICAL first, then INFORMATIONAL, sorted by confidence descending]
[SEVERITY] (confidence: N/10, specialist: name) path:line — summary
  Fix: recommended fix
  [If MULTI-SPECIALIST CONFIRMED: show confirmation note]

**Step 5. Generate tests for gaps (Fix-First):**

If test framework is detected and gaps were identified:
- Classify each gap as AUTO-FIX or ASK per the Fix-First Heuristic:
  - **AUTO-FIX:** Simple unit tests for pure functions, edge cases of existing tested functions
  - **ASK:** E2E tests, tests requiring new test infrastructure, tests for ambiguous behavior
- For AUTO-FIX gaps: generate the test, run it, commit as `test: coverage for {feature}`
- For ASK gaps: include in the Fix-First batch question with the other review findings
- For paths marked [→E2E]: always ASK (E2E tests are higher-effort and need user confirmation)
- For paths marked [→EVAL]: always ASK (eval tests need user confirmation on quality criteria)
PR Quality Score: X/10
```

If no test framework detected → include gaps as INFORMATIONAL findings only, no generation.
These findings flow into Step 5 Fix-First alongside the CRITICAL pass findings from Step 4.
The Fix-First heuristic applies identically — specialist findings follow the same AUTO-FIX vs ASK classification.

**Diff is test-only changes:** Skip Step 4.75 entirely: "No new application code paths to audit."
---

### Coverage Warning
### Red Team dispatch (conditional)

After producing the coverage diagram, check the coverage percentage. Read CLAUDE.md for a `## Test Coverage` section with a `Minimum:` field. If not found, use default: 60%.
**Activation:** Only if DIFF_LINES > 200 OR any specialist produced a CRITICAL finding.

If coverage is below the minimum threshold, output a prominent warning **before** the regular review findings:
If activated, dispatch one more subagent via the Agent tool (foreground, not background).

```
⚠️ COVERAGE WARNING: AI-assessed coverage is {X}%. {N} code paths untested.
Consider writing tests before running /ship.
```
The Red Team subagent receives:
1. The red-team checklist from `~/.claude/skills/gstack/review/specialists/red-team.md`
2. The merged specialist findings from Step 4.6 (so it knows what was already caught)
3. The git diff command

This is INFORMATIONAL — does not block /review. But it makes low coverage visible early so the developer can address it before reaching the /ship coverage gate.
Prompt: "You are a red team reviewer. The code has already been reviewed by N specialists
who found the following issues: {merged findings summary}. Your job is to find what they
MISSED. Read the checklist, run `git diff origin/<base>`, and look for gaps.
Output findings as JSON objects (same schema as the specialists). Focus on cross-cutting
concerns, integration boundary issues, and failure modes that specialist checklists
don't cover."

If coverage percentage cannot be determined, skip the warning silently.
If the Red Team finds additional issues, merge them into the findings list before
Step 5 Fix-First. Red Team findings are tagged with `"specialist":"red-team"`.

This step subsumes the "Test Gaps" category from Pass 2 — do not duplicate findings between the checklist Test Gaps item and this coverage diagram. Include any coverage gaps alongside the findings from Step 4 and Step 4.5. They follow the same Fix-First flow — gaps are INFORMATIONAL findings.
If the Red Team returns NO FINDINGS, note: "Red Team review: no additional issues found."
If the Red Team subagent fails or times out, skip silently and continue.

---


M review/SKILL.md.tmpl => review/SKILL.md.tmpl +5 -17
@@ 73,12 73,12 @@ Run `git diff origin/<base>` to get the full diff. This includes both committed 

{{LEARNINGS_SEARCH}}

## Step 4: Two-pass review
## Step 4: Critical pass (core review)

Apply the checklist against the diff in two passes:
Apply the CRITICAL categories from the checklist against the diff:
SQL & Data Safety, Race Conditions & Concurrency, LLM Output Trust Boundary, Shell Injection, Enum & Value Completeness.

1. **Pass 1 (CRITICAL):** SQL & Data Safety, Race Conditions & Concurrency, LLM Output Trust Boundary, Enum & Value Completeness
2. **Pass 2 (INFORMATIONAL):** Conditional Side Effects, Magic Numbers & String Coupling, Dead Code & Consistency, LLM Prompt Issues, Test Gaps, View/Frontend, Performance & Bundle Impact
Also apply the remaining INFORMATIONAL categories that are still in the checklist (Async/Sync Mixing, Column/Field Name Safety, LLM Prompt Issues, Type Coercion, View/Frontend, Time Window Safety, Completeness Gaps, Distribution & CI/CD).

**Enum & Value Completeness requires reading code OUTSIDE the diff.** When the diff introduces a new enum value, status, tier, or type constant, use Grep to find all files that reference sibling values, then Read those files to check if the new value is handled. This is the one category where within-diff review is insufficient.



@@ 95,19 95,7 @@ Follow the output format specified in the checklist. Respect the suppressions 

---

## Step 4.5: Design Review (conditional)

{{DESIGN_REVIEW_LITE}}

Include any design findings alongside the findings from Step 4. They follow the same Fix-First flow in Step 5 — AUTO-FIX for mechanical CSS fixes, ASK for everything else.

---

## Step 4.75: Test Coverage Diagram

{{TEST_COVERAGE_AUDIT_REVIEW}}

This step subsumes the "Test Gaps" category from Pass 2 — do not duplicate findings between the checklist Test Gaps item and this coverage diagram. Include any coverage gaps alongside the findings from Step 4 and Step 4.5. They follow the same Fix-First flow — gaps are INFORMATIONAL findings.
{{REVIEW_ARMY}}

---


M review/checklist.md => review/checklist.md +12 -52
@@ 5,8 5,9 @@
Review the `git diff origin/main` output for the issues listed below. Be specific — cite `file:line` and suggest fixes. Skip anything that's fine. Only flag real problems.

**Two-pass review:**
- **Pass 1 (CRITICAL):** Run SQL & Data Safety and LLM Output Trust Boundary first. Highest severity.
- **Pass 2 (INFORMATIONAL):** Run all remaining categories. Lower severity but still actioned.
- **Pass 1 (CRITICAL):** Run SQL & Data Safety, Race Conditions, LLM Output Trust Boundary, Shell Injection, and Enum Completeness first. Highest severity.
- **Pass 2 (INFORMATIONAL):** Run remaining categories below. Lower severity but still actioned.
- **Specialist categories (handled by parallel subagents, NOT this checklist):** Test Gaps, Dead Code, Magic Numbers, Conditional Side Effects, Performance & Bundle Impact, Crypto & Entropy. See `review/specialists/` for these.

All findings get action via Fix-First Review: obvious mechanical fixes are applied automatically,
genuinely ambiguous issues are batched into a single user question.


@@ 76,42 77,21 @@ To do this: use Grep to find all references to the sibling values (e.g., grep fo
- Check `.get()` calls on query results use the column name that was actually selected
- Cross-reference with schema documentation when available

#### Conditional Side Effects
- Code paths that branch on a condition but forget to apply a side effect on one branch. Example: item promoted to verified but URL only attached when a secondary condition is true — the other branch promotes without the URL, creating an inconsistent record.
- Log messages that claim an action happened but the action was conditionally skipped. The log should reflect what actually occurred.

#### Magic Numbers & String Coupling
- Bare numeric literals used in multiple files — should be named constants documented together
- Error message strings used as query filters elsewhere (grep for the string — is anything matching on it?)

#### Dead Code & Consistency
- Variables assigned but never read
#### Dead Code & Consistency (version/changelog only — other items handled by maintainability specialist)
- Version mismatch between PR title and VERSION/CHANGELOG files
- CHANGELOG entries that describe changes inaccurately (e.g., "changed from X to Y" when X never existed)
- Comments/docstrings that describe old behavior after the code changed

#### LLM Prompt Issues
- 0-indexed lists in prompts (LLMs reliably return 1-indexed)
- Prompt text listing available tools/capabilities that don't match what's actually wired up in the `tool_classes`/`tools` array
- Word/token limits stated in multiple places that could drift

#### Test Gaps
- Negative-path tests that assert type/status but not the side effects (URL attached? field populated? callback fired?)
- Assertions on string content without checking format (e.g., asserting title present but not URL format)
- `.expects(:something).never` missing when a code path should explicitly NOT call an external service
- Security enforcement features (blocking, rate limiting, auth) without integration tests verifying the enforcement path works end-to-end

#### Completeness Gaps
- Shortcut implementations where the complete version would cost <30 minutes CC time (e.g., partial enum handling, incomplete error paths, missing edge cases that are straightforward to add)
- Options presented with only human-team effort estimates — should show both human and CC+gstack time
- Test coverage gaps where adding the missing tests is a "lake" not an "ocean" (e.g., missing negative-path tests, missing edge case tests that mirror happy-path structure)
- Features implemented at 80-90% when 100% is achievable with modest additional code

#### Crypto & Entropy
- Truncation of data instead of hashing (last N chars instead of SHA-256) — less entropy, easier collisions
- `rand()` / `Random.rand` for security-sensitive values — use `SecureRandom` instead
- Non-constant-time comparisons (`==`) on secrets or tokens — vulnerable to timing attacks

#### Time Window Safety
- Date-key lookups that assume "today" covers 24h — report at 8am PT only sees midnight→8am under today's key
- Mismatched time windows between related features — one uses hourly buckets, another uses daily keys for the same data


@@ 125,23 105,6 @@ To do this: use Grep to find all references to the sibling values (e.g., grep fo
- O(n*m) lookups in views (`Array#find` in a loop instead of `index_by` hash)
- Ruby-side `.select{}` filtering on DB results that could be a `WHERE` clause (unless intentionally avoiding leading-wildcard `LIKE`)

#### Performance & Bundle Impact
- New `dependencies` entries in package.json that are known-heavy: moment.js (→ date-fns, 330KB→22KB), lodash full (→ lodash-es or per-function imports), jquery, core-js full polyfill
- Significant lockfile growth (many new transitive dependencies from a single addition)
- Images added without `loading="lazy"` or explicit width/height attributes (causes layout shift / CLS)
- Large static assets committed to repo (>500KB per file)
- Synchronous `<script>` tags without async/defer
- CSS `@import` in stylesheets (blocks parallel loading — use bundler imports instead)
- `useEffect` with fetch that depends on another fetch result (request waterfall — combine or parallelize)
- Named → default import switches on tree-shakeable libraries (breaks tree-shaking)
- New `require()` calls in ESM codebases

**DO NOT flag:**
- devDependencies additions (don't affect production bundle)
- Dynamic `import()` calls (code splitting — these are good)
- Small utility additions (<5KB gzipped)
- Server-side-only dependencies

#### Distribution & CI/CD Pipeline
- CI/CD workflow changes (`.github/workflows/`): verify build tool versions match project requirements, artifact names/paths are correct, secrets use `${{ secrets.X }}` not hardcoded values
- New artifact types (CLI binary, library, package): verify a publish/release workflow exists and targets correct platforms


@@ 159,18 122,15 @@ To do this: use Grep to find all references to the sibling values (e.g., grep fo
## Severity Classification

```
CRITICAL (highest severity):      INFORMATIONAL (lower severity):
├─ SQL & Data Safety              ├─ Conditional Side Effects
├─ Race Conditions & Concurrency  ├─ Magic Numbers & String Coupling
├─ LLM Output Trust Boundary      ├─ Dead Code & Consistency
└─ Enum & Value Completeness      ├─ LLM Prompt Issues
                                   ├─ Test Gaps
                                   ├─ Completeness Gaps
                                   ├─ Crypto & Entropy
                                   ├─ Time Window Safety
                                   ├─ Type Coercion at Boundaries
CRITICAL (highest severity):      INFORMATIONAL (main agent):      SPECIALIST (parallel subagents):
├─ SQL & Data Safety              ├─ Async/Sync Mixing             ├─ Testing specialist
├─ Race Conditions & Concurrency  ├─ Column/Field Name Safety      ├─ Maintainability specialist
├─ LLM Output Trust Boundary      ├─ Dead Code (version only)      ├─ Security specialist
├─ Shell Injection                ├─ LLM Prompt Issues             ├─ Performance specialist
└─ Enum & Value Completeness      ├─ Completeness Gaps             ├─ Data Migration specialist
                                   ├─ Time Window Safety            ├─ API Contract specialist
                                   ├─ Type Coercion at Boundaries   └─ Red Team (conditional)
                                   ├─ View/Frontend
                                   ├─ Performance & Bundle Impact
                                   └─ Distribution & CI/CD Pipeline

All findings are actioned via Fix-First Review. Severity determines

A review/specialists/api-contract.md => review/specialists/api-contract.md +48 -0
@@ 0,0 1,48 @@
# API Contract Specialist Review Checklist

Scope: When SCOPE_API=true
Output: JSON objects, one finding per line. Schema:
{"severity":"CRITICAL|INFORMATIONAL","confidence":N,"path":"file","line":N,"category":"api-contract","summary":"...","fix":"...","fingerprint":"path:line:api-contract","specialist":"api-contract"}
If no findings: output `NO FINDINGS` and nothing else.

---

## Categories

### Breaking Changes
- Removed fields from response bodies (clients may depend on them)
- Changed field types (string → number, object → array)
- New required parameters added to existing endpoints
- Changed HTTP methods (GET → POST) or status codes (200 → 201)
- Renamed endpoints without maintaining the old path as a redirect/alias
- Changed authentication requirements (public → authenticated)

### Versioning Strategy
- Breaking changes made without a version bump (v1 → v2)
- Multiple versioning strategies mixed in the same API (URL vs header vs query param)
- Deprecated endpoints without a sunset timeline or migration guide
- Version-specific logic scattered across controllers instead of centralized

### Error Response Consistency
- New endpoints returning different error formats than existing ones
- Error responses missing standard fields (error code, message, details)
- HTTP status codes that don't match the error type (200 for errors, 500 for validation)
- Error messages that leak internal implementation details (stack traces, SQL)

### Rate Limiting & Pagination
- New endpoints missing rate limiting when similar endpoints have it
- Pagination changes (offset → cursor) without backwards compatibility
- Changed page sizes or default limits without documentation
- Missing total count or next-page indicators in paginated responses

### Documentation Drift
- OpenAPI/Swagger spec not updated to match new endpoints or changed params
- README or API docs describing old behavior after changes
- Example requests/responses that no longer work
- Missing documentation for new endpoints or changed parameters

### Backwards Compatibility
- Clients on older versions: will they break?
- Mobile apps that can't force-update: does the API still work for them?
- Webhook payloads changed without notifying subscribers
- SDK or client library changes needed to use new features

A review/specialists/data-migration.md => review/specialists/data-migration.md +47 -0
@@ 0,0 1,47 @@
# Data Migration Specialist Review Checklist

Scope: When SCOPE_MIGRATIONS=true
Output: JSON objects, one finding per line. Schema:
{"severity":"CRITICAL|INFORMATIONAL","confidence":N,"path":"file","line":N,"category":"data-migration","summary":"...","fix":"...","fingerprint":"path:line:data-migration","specialist":"data-migration"}
If no findings: output `NO FINDINGS` and nothing else.

---

## Categories

### Reversibility
- Can this migration be rolled back without data loss?
- Is there a corresponding down/rollback migration?
- Does the rollback actually undo the change or just no-op?
- Would rolling back break the current application code?

### Data Loss Risk
- Dropping columns that still contain data (add deprecation period first)
- Changing column types that truncate data (varchar(255) → varchar(50))
- Removing tables without verifying no code references them
- Renaming columns without updating all references (ORM, raw SQL, views)
- NOT NULL constraints added to columns with existing NULL values (needs backfill first)

### Lock Duration
- ALTER TABLE on large tables without CONCURRENTLY (PostgreSQL)
- Adding indexes without CONCURRENTLY on tables with >100K rows
- Multiple ALTER TABLE statements that could be combined into one lock acquisition
- Schema changes that acquire exclusive locks during peak traffic hours

### Backfill Strategy
- New NOT NULL columns without DEFAULT value (requires backfill before constraint)
- New columns with computed defaults that need batch population
- Missing backfill script or rake task for existing records
- Backfill that updates all rows at once instead of batching (locks table)

### Index Creation
- CREATE INDEX without CONCURRENTLY on production tables
- Duplicate indexes (new index covers same columns as existing one)
- Missing indexes on new foreign key columns
- Partial indexes where a full index would be more useful (or vice versa)

### Multi-Phase Safety
- Migrations that must be deployed in a specific order with application code
- Schema changes that break the current running code (deploy code first, then migrate)
- Migrations that assume a deploy boundary (old code + new schema = crash)
- Missing feature flag to handle mixed old/new code during rolling deploy

A review/specialists/maintainability.md => review/specialists/maintainability.md +45 -0
@@ 0,0 1,45 @@
# Maintainability Specialist Review Checklist

Scope: Always-on (every review)
Output: JSON objects, one finding per line. Schema:
{"severity":"INFORMATIONAL","confidence":N,"path":"file","line":N,"category":"maintainability","summary":"...","fix":"...","fingerprint":"path:line:maintainability","specialist":"maintainability"}
If no findings: output `NO FINDINGS` and nothing else.

---

## Categories

### Dead Code & Unused Imports
- Variables assigned but never read in the changed files
- Functions/methods defined but never called (check with Grep across the repo)
- Imports/requires that are no longer referenced after the change
- Commented-out code blocks (either remove or explain why they exist)

### Magic Numbers & String Coupling
- Bare numeric literals used in logic (thresholds, limits, retry counts) — should be named constants
- Error message strings used as query filters or conditionals elsewhere
- Hardcoded URLs, ports, or hostnames that should be config
- Duplicated literal values across multiple files

### Stale Comments & Docstrings
- Comments that describe old behavior after the code was changed in this diff
- TODO/FIXME comments that reference completed work
- Docstrings with parameter lists that don't match the current function signature
- ASCII diagrams in comments that no longer match the code flow

### DRY Violations
- Similar code blocks (3+ lines) appearing multiple times within the diff
- Copy-paste patterns where a shared helper would be cleaner
- Configuration or setup logic duplicated across test files
- Repeated conditional chains that could be a lookup table or map

### Conditional Side Effects
- Code paths that branch on a condition but forget a side effect on one branch
- Log messages that claim an action happened but the action was conditionally skipped
- State transitions where one branch updates related records but the other doesn't
- Event emissions that only fire on the happy path, missing error/edge paths

### Module Boundary Violations
- Reaching into another module's internal implementation (accessing private-by-convention methods)
- Direct database queries in controllers/views that should go through a service/model
- Tight coupling between components that should communicate through interfaces

A review/specialists/performance.md => review/specialists/performance.md +51 -0
@@ 0,0 1,51 @@
# Performance Specialist Review Checklist

Scope: When SCOPE_BACKEND=true OR SCOPE_FRONTEND=true
Output: JSON objects, one finding per line. Schema:
{"severity":"CRITICAL|INFORMATIONAL","confidence":N,"path":"file","line":N,"category":"performance","summary":"...","fix":"...","fingerprint":"path:line:performance","specialist":"performance"}
If no findings: output `NO FINDINGS` and nothing else.

---

## Categories

### N+1 Queries
- ActiveRecord/ORM associations traversed in loops without eager loading (.includes, joinedload, include)
- Database queries inside iteration blocks (each, map, forEach) that could be batched
- Nested serializers that trigger lazy-loaded associations
- GraphQL resolvers that query per-field instead of batching (check for DataLoader usage)

### Missing Database Indexes
- New WHERE clauses on columns without indexes (check migration files or schema)
- New ORDER BY on non-indexed columns
- Composite queries (WHERE a AND b) without composite indexes
- Foreign key columns added without indexes

### Algorithmic Complexity
- O(n^2) or worse patterns: nested loops over collections, Array.find inside Array.map
- Repeated linear searches that could use a hash/map/set lookup
- String concatenation in loops (use join or StringBuilder)
- Sorting or filtering large collections multiple times when once would suffice

### Bundle Size Impact (Frontend)
- New production dependencies that are known-heavy (moment.js, lodash full, jquery)
- Barrel imports (import from 'library') instead of deep imports (import from 'library/specific')
- Large static assets (images, fonts) committed without optimization
- Missing code splitting for route-level chunks

### Rendering Performance (Frontend)
- Fetch waterfalls: sequential API calls that could be parallel (Promise.all)
- Unnecessary re-renders from unstable references (new objects/arrays in render)
- Missing React.memo, useMemo, or useCallback on expensive computations
- Layout thrashing from reading then writing DOM properties in loops
- Missing loading="lazy" on below-fold images

### Missing Pagination
- List endpoints that return unbounded results (no LIMIT, no pagination params)
- Database queries without LIMIT that grow with data volume
- API responses that embed full nested objects instead of IDs with expansion

### Blocking in Async Contexts
- Synchronous I/O (file reads, subprocess, HTTP requests) inside async functions
- time.sleep() / Thread.sleep() inside event-loop-based handlers
- CPU-intensive computation blocking the main thread without worker offload

A review/specialists/red-team.md => review/specialists/red-team.md +44 -0
@@ 0,0 1,44 @@
# Red Team Review

Scope: When diff > 200 lines OR security specialist found CRITICAL findings. Runs AFTER other specialists.
Output: JSON objects, one finding per line. Schema:
{"severity":"CRITICAL|INFORMATIONAL","confidence":N,"path":"file","line":N,"category":"red-team","summary":"...","fix":"...","fingerprint":"path:line:red-team","specialist":"red-team"}
If no findings: output `NO FINDINGS` and nothing else.

---

This is NOT a checklist review. This is adversarial analysis.

You have access to the other specialists' findings (provided in your prompt). Your job is to find what they MISSED. Think like an attacker, a chaos engineer, and a hostile QA tester simultaneously.

## Approach

### 1. Attack the Happy Path
- What happens when the system is under 10x normal load?
- What happens when two requests hit the same resource simultaneously?
- What happens when the database is slow (>5s query time)?
- What happens when an external service returns garbage?

### 2. Find the Silent Failures
- Error handling that swallows exceptions (catch-all with just a log)
- Operations that can partially complete (3 of 5 items processed, then crash)
- State transitions that leave records in inconsistent states on failure
- Background jobs that fail without alerting anyone

### 3. Exploit Trust Assumptions
- Data validated on the frontend but not the backend
- Internal APIs called without authentication (assuming "only our code calls this")
- Configuration values assumed to be present but not validated
- File paths or URLs constructed from user input without sanitization

### 4. Break the Edge Cases
- What happens with the maximum possible input size?
- What happens with zero items, empty strings, null values?
- What happens on the first run ever (no existing data)?
- What happens when the user clicks the button twice in 100ms?

### 5. Find What the Other Specialists Missed
- Review each specialist's findings. What's the gap between their categories?
- Look for cross-category issues (e.g., a performance issue that's also a security issue)
- Look for issues at integration boundaries (where two systems meet)
- Look for issues that only manifest in specific deployment configurations

A review/specialists/security.md => review/specialists/security.md +60 -0
@@ 0,0 1,60 @@
# Security Specialist Review Checklist

Scope: When SCOPE_AUTH=true OR (SCOPE_BACKEND=true AND diff > 100 lines)
Output: JSON objects, one finding per line. Schema:
{"severity":"CRITICAL|INFORMATIONAL","confidence":N,"path":"file","line":N,"category":"security","summary":"...","fix":"...","fingerprint":"path:line:security","specialist":"security"}
If no findings: output `NO FINDINGS` and nothing else.

---

This checklist goes deeper than the main CRITICAL pass. The main agent already checks SQL injection, race conditions, LLM trust, and enum completeness. This specialist focuses on auth/authz patterns, cryptographic misuse, and attack surface expansion.

## Categories

### Input Validation at Trust Boundaries
- User input accepted without validation at controller/handler level
- Query parameters used directly in database queries or file paths
- Request body fields accepted without type checking or schema validation
- File uploads without type/size/content validation
- Webhook payloads processed without signature verification

### Auth & Authorization Bypass
- Endpoints missing authentication middleware (check route definitions)
- Authorization checks that default to "allow" instead of "deny"
- Role escalation paths (user can modify their own role/permissions)
- Direct object reference vulnerabilities (user A accesses user B's data by changing an ID)
- Session fixation or session hijacking opportunities
- Token/API key validation that doesn't check expiration

### Injection Vectors (beyond SQL)
- Command injection via subprocess calls with user-controlled arguments
- Template injection (Jinja2, ERB, Handlebars) with user input
- LDAP injection in directory queries
- SSRF via user-controlled URLs (fetch, redirect, webhook targets)
- Path traversal via user-controlled file paths (../../etc/passwd)
- Header injection via user-controlled values in HTTP headers

### Cryptographic Misuse
- Weak hashing algorithms (MD5, SHA1) for security-sensitive operations
- Predictable randomness (Math.random, rand()) for tokens or secrets
- Non-constant-time comparisons (==) on secrets, tokens, or digests
- Hardcoded encryption keys or IVs
- Missing salt in password hashing

### Secrets Exposure
- API keys, tokens, or passwords in source code (even in comments)
- Secrets logged in application logs or error messages
- Credentials in URLs (query parameters or basic auth in URL)
- Sensitive data in error responses returned to users
- PII stored in plaintext when encryption is expected

### XSS via Escape Hatches
- Rails: .html_safe, raw() on user-controlled data
- React: dangerouslySetInnerHTML with user content
- Vue: v-html with user content
- Django: |safe, mark_safe() on user input
- General: innerHTML assignment with unsanitized data

### Deserialization
- Deserializing untrusted data (pickle, Marshal, YAML.load, JSON.parse of executable types)
- Accepting serialized objects from user input or external APIs without schema validation

A review/specialists/testing.md => review/specialists/testing.md +45 -0
@@ 0,0 1,45 @@
# Testing Specialist Review Checklist

Scope: Always-on (every review)
Output: JSON objects, one finding per line. Schema:
{"severity":"CRITICAL|INFORMATIONAL","confidence":N,"path":"file","line":N,"category":"testing","summary":"...","fix":"...","fingerprint":"path:line:testing","specialist":"testing"}
If no findings: output `NO FINDINGS` and nothing else.

---

## Categories

### Missing Negative-Path Tests
- New code paths that handle errors, rejections, or invalid input with NO corresponding test
- Guard clauses and early returns that are untested
- Error branches in try/catch, rescue, or error boundaries with no failure-path test
- Permission/auth checks that are asserted in code but never tested for the "denied" case

### Missing Edge-Case Coverage
- Boundary values: zero, negative, max-int, empty string, empty array, nil/null/undefined
- Single-element collections (off-by-one on loops)
- Unicode and special characters in user-facing inputs
- Concurrent access patterns with no race-condition test

### Test Isolation Violations
- Tests sharing mutable state (class variables, global singletons, DB records not cleaned up)
- Order-dependent tests (pass in sequence, fail when randomized)
- Tests that depend on system clock, timezone, or locale
- Tests that make real network calls instead of using stubs/mocks

### Flaky Test Patterns
- Timing-dependent assertions (sleep, setTimeout, waitFor with tight timeouts)
- Assertions on ordering of unordered results (hash keys, Set iteration, async resolution order)
- Tests that depend on external services (APIs, databases) without fallback
- Randomized test data without seed control

### Security Enforcement Tests Missing
- Auth/authz checks in controllers with no test for the "unauthorized" case
- Rate limiting logic with no test proving it actually blocks
- Input sanitization with no test for malicious input
- CSRF/CORS configuration with no integration test

### Coverage Gaps
- New public methods/functions with zero test coverage
- Changed methods where existing tests only cover the old behavior, not the new branch
- Utility functions called from multiple places but tested only indirectly

M scripts/resolvers/index.ts => scripts/resolvers/index.ts +2 -0
@@ 16,6 16,7 @@ import { generateSlugEval, generateSlugSetup, generateBaseBranchDetect, generate
import { generateLearningsSearch, generateLearningsLog } from './learnings';
import { generateConfidenceCalibration } from './confidence';
import { generateInvokeSkill } from './composition';
import { generateReviewArmy } from './review-army';

export const RESOLVERS: Record<string, ResolverFn> = {
  SLUG_EVAL: generateSlugEval,


@@ 57,4 58,5 @@ export const RESOLVERS: Record<string, ResolverFn> = {
  CONFIDENCE_CALIBRATION: generateConfidenceCalibration,
  INVOKE_SKILL: generateInvokeSkill,
  CHANGELOG_WORKFLOW: generateChangelogWorkflow,
  REVIEW_ARMY: generateReviewArmy,
};

A scripts/resolvers/review-army.ts => scripts/resolvers/review-army.ts +190 -0
@@ 0,0 1,190 @@
/**
 * Review Army resolver — parallel specialist reviewers for /review
 *
 * Generates template prose that instructs Claude to:
 * 1. Detect stack and scope (via gstack-diff-scope)
 * 2. Select and dispatch specialist subagents in parallel
 * 3. Collect, parse, merge, and deduplicate JSON findings
 * 4. Feed merged findings into the existing Fix-First pipeline
 *
 * Shipped as Release 2 of the self-learning roadmap (SELF_LEARNING_V0.md).
 */
import type { TemplateContext } from './types';

function generateSpecialistSelection(ctx: TemplateContext): string {
  return `## Step 4.5: Review Army — Specialist Dispatch

### Detect stack and scope

\`\`\`bash
source <(${ctx.paths.binDir}/gstack-diff-scope <base> 2>/dev/null) || true
# Detect stack for specialist context
STACK=""
[ -f Gemfile ] && STACK="\${STACK}ruby "
[ -f package.json ] && STACK="\${STACK}node "
[ -f requirements.txt ] || [ -f pyproject.toml ] && STACK="\${STACK}python "
[ -f go.mod ] && STACK="\${STACK}go "
[ -f Cargo.toml ] && STACK="\${STACK}rust "
echo "STACK: \${STACK:-unknown}"
DIFF_LINES=$(git diff origin/<base> --stat | tail -1 | grep -oE '[0-9]+ insertion' | grep -oE '[0-9]+' || echo "0")
echo "DIFF_LINES: $DIFF_LINES"
\`\`\`

### Select specialists

Based on the scope signals above, select which specialists to dispatch.

**Always-on (dispatch on every review with 50+ changed lines):**
1. **Testing** — read \`${ctx.paths.skillRoot}/review/specialists/testing.md\`
2. **Maintainability** — read \`${ctx.paths.skillRoot}/review/specialists/maintainability.md\`

**If DIFF_LINES < 50:** Skip all specialists. Print: "Small diff ($DIFF_LINES lines) — specialists skipped." Continue to Step 5.

**Conditional (dispatch if the matching scope signal is true):**
3. **Security** — if SCOPE_AUTH=true, OR if SCOPE_BACKEND=true AND DIFF_LINES > 100. Read \`${ctx.paths.skillRoot}/review/specialists/security.md\`
4. **Performance** — if SCOPE_BACKEND=true OR SCOPE_FRONTEND=true. Read \`${ctx.paths.skillRoot}/review/specialists/performance.md\`
5. **Data Migration** — if SCOPE_MIGRATIONS=true. Read \`${ctx.paths.skillRoot}/review/specialists/data-migration.md\`
6. **API Contract** — if SCOPE_API=true. Read \`${ctx.paths.skillRoot}/review/specialists/api-contract.md\`
7. **Design** — if SCOPE_FRONTEND=true. Use the existing design review checklist at \`${ctx.paths.skillRoot}/review/design-checklist.md\`

Note which specialists were selected and which were skipped. Print the selection:
"Dispatching N specialists: [names]. Skipped: [names] (scope not detected)."`;
}

function generateSpecialistDispatch(ctx: TemplateContext): string {
  return `### Dispatch specialists in parallel

For each selected specialist, launch an independent subagent via the Agent tool.
**Launch ALL selected specialists in a single message** (multiple Agent tool calls)
so they run in parallel. Each subagent has fresh context — no prior review bias.

**Each specialist subagent prompt:**

Construct the prompt for each specialist. The prompt includes:

1. The specialist's checklist content (you already read the file above)
2. Stack context: "This is a {STACK} project."
3. Past learnings for this domain (if any exist):

\`\`\`bash
${ctx.paths.binDir}/gstack-learnings-search --type pitfall --query "{specialist domain}" --limit 5 2>/dev/null || true
\`\`\`

If learnings are found, include them: "Past learnings for this domain: {learnings}"

4. Instructions:

"You are a specialist code reviewer. Read the checklist below, then run
\`git diff origin/<base>\` to get the full diff. Apply the checklist against the diff.

For each finding, output a JSON object on its own line:
{\\"severity\\":\\"CRITICAL|INFORMATIONAL\\",\\"confidence\\":N,\\"path\\":\\"file\\",\\"line\\":N,\\"category\\":\\"category\\",\\"summary\\":\\"description\\",\\"fix\\":\\"recommended fix\\",\\"fingerprint\\":\\"path:line:category\\",\\"specialist\\":\\"name\\"}

Required fields: severity, confidence, path, category, summary, specialist.
Optional: line, fix, fingerprint, evidence.

If no findings: output \`NO FINDINGS\` and nothing else.
Do not output anything else — no preamble, no summary, no commentary.

Stack context: {STACK}
Past learnings: {learnings or 'none'}

CHECKLIST:
{checklist content}"

**Subagent configuration:**
- Use \`subagent_type: "general-purpose"\`
- Do NOT use \`run_in_background\` — all specialists must complete before merge
- If any specialist subagent fails or times out, log the failure and continue with results from successful specialists. Specialists are additive — partial results are better than no results.`;
}

function generateFindingsMerge(_ctx: TemplateContext): string {
  return `### Step 4.6: Collect and merge findings

After all specialist subagents complete, collect their outputs.

**Parse findings:**
For each specialist's output:
1. If output is "NO FINDINGS" — skip, this specialist found nothing
2. Otherwise, parse each line as a JSON object. Skip lines that are not valid JSON.
3. Collect all parsed findings into a single list, tagged with their specialist name.

**Fingerprint and deduplicate:**
For each finding, compute its fingerprint:
- If \`fingerprint\` field is present, use it
- Otherwise: \`{path}:{line}:{category}\` (if line is present) or \`{path}:{category}\`

Group findings by fingerprint. For findings sharing the same fingerprint:
- Keep the finding with the highest confidence score
- Tag it: "MULTI-SPECIALIST CONFIRMED ({specialist1} + {specialist2})"
- Boost confidence by +1 (cap at 10)
- Note the confirming specialists in the output

**Apply confidence gates:**
- Confidence 7+: show normally in the findings output
- Confidence 5-6: show with caveat "Medium confidence — verify this is actually an issue"
- Confidence 3-4: move to appendix (suppress from main findings)
- Confidence 1-2: suppress entirely

**Compute PR Quality Score:**
After merging, compute the quality score:
\`quality_score = max(0, 10 - (critical_count * 2 + informational_count * 0.5))\`
Cap at 10. Log this in the review result at the end.

**Output merged findings:**
Present the merged findings in the same format as the current review:

\`\`\`
SPECIALIST REVIEW: N findings (X critical, Y informational) from Z specialists

[For each finding, in order: CRITICAL first, then INFORMATIONAL, sorted by confidence descending]
[SEVERITY] (confidence: N/10, specialist: name) path:line — summary
  Fix: recommended fix
  [If MULTI-SPECIALIST CONFIRMED: show confirmation note]

PR Quality Score: X/10
\`\`\`

These findings flow into Step 5 Fix-First alongside the CRITICAL pass findings from Step 4.
The Fix-First heuristic applies identically — specialist findings follow the same AUTO-FIX vs ASK classification.`;
}

function generateRedTeam(ctx: TemplateContext): string {
  return `### Red Team dispatch (conditional)

**Activation:** Only if DIFF_LINES > 200 OR any specialist produced a CRITICAL finding.

If activated, dispatch one more subagent via the Agent tool (foreground, not background).

The Red Team subagent receives:
1. The red-team checklist from \`${ctx.paths.skillRoot}/review/specialists/red-team.md\`
2. The merged specialist findings from Step 4.6 (so it knows what was already caught)
3. The git diff command

Prompt: "You are a red team reviewer. The code has already been reviewed by N specialists
who found the following issues: {merged findings summary}. Your job is to find what they
MISSED. Read the checklist, run \`git diff origin/<base>\`, and look for gaps.
Output findings as JSON objects (same schema as the specialists). Focus on cross-cutting
concerns, integration boundary issues, and failure modes that specialist checklists
don't cover."

If the Red Team finds additional issues, merge them into the findings list before
Step 5 Fix-First. Red Team findings are tagged with \`"specialist":"red-team"\`.

If the Red Team returns NO FINDINGS, note: "Red Team review: no additional issues found."
If the Red Team subagent fails or times out, skip silently and continue.`;
}

export function generateReviewArmy(ctx: TemplateContext): string {
  // Codex host: strip entirely — Codex should not run Review Army
  if (ctx.host === 'codex') return '';

  const sections = [
    generateSpecialistSelection(ctx),
    generateSpecialistDispatch(ctx),
    generateFindingsMerge(ctx),
    generateRedTeam(ctx),
  ];

  return sections.join('\n\n---\n\n');
}

M scripts/resolvers/review.ts => scripts/resolvers/review.ts +59 -4
@@ 817,16 817,71 @@ After producing the completion checklist:

**Include in PR body (Step 8):** Add a \`## Plan Completion\` section with the checklist summary.`);
  } else {
    // review mode
    // review mode — enhanced Delivery Integrity (Release 2: Review Army)
    sections.push(`
### Fallback Intent Sources (when no plan file found)

When no plan file is detected, use these secondary intent sources:

1. **Commit messages:** Run \`git log origin/<base>..HEAD --oneline\`. Use judgment to extract real intent:
   - Commits with actionable verbs ("add", "implement", "fix", "create", "remove", "update") are intent signals
   - Skip noise: "WIP", "tmp", "squash", "merge", "chore", "typo", "fixup"
   - Extract the intent behind the commit, not the literal message
2. **TODOS.md:** If it exists, check for items related to this branch or recent dates
3. **PR description:** Run \`gh pr view --json body -q .body 2>/dev/null\` for intent context

**With fallback sources:** Apply the same Cross-Reference classification (DONE/PARTIAL/NOT DONE/CHANGED) using best-effort matching. Note that fallback-sourced items are lower confidence than plan-file items.

### Investigation Depth

For each PARTIAL or NOT DONE item, investigate WHY:

1. Check \`git log origin/<base>..HEAD --oneline\` for commits that suggest the work was started, attempted, or reverted
2. Read the relevant code to understand what was built instead
3. Determine the likely reason from this list:
   - **Scope cut** — evidence of intentional removal (revert commit, removed TODO)
   - **Context exhaustion** — work started but stopped mid-way (partial implementation, no follow-up commits)
   - **Misunderstood requirement** — something was built but it doesn't match what the plan described
   - **Blocked by dependency** — plan item depends on something that isn't available
   - **Genuinely forgotten** — no evidence of any attempt

Output for each discrepancy:
\`\`\`
DISCREPANCY: {PARTIAL|NOT_DONE} | {plan item} | {what was actually delivered}
INVESTIGATION: {likely reason with evidence from git log / code}
IMPACT: {HIGH|MEDIUM|LOW} — {what breaks or degrades if this stays undelivered}
\`\`\`

### Learnings Logging (plan-file discrepancies only)

**Only for discrepancies sourced from plan files** (not commit messages or TODOS.md), log a learning so future sessions know this pattern occurred:

\`\`\`bash
~/.claude/skills/gstack/bin/gstack-learnings-log '{
  "type": "pitfall",
  "key": "plan-delivery-gap-KEBAB_SUMMARY",
  "insight": "Planned X but delivered Y because Z",
  "confidence": 8,
  "source": "observed",
  "files": ["PLAN_FILE_PATH"]
}'
\`\`\`

Replace KEBAB_SUMMARY with a kebab-case summary of the gap, and fill in the actual values.

**Do NOT log learnings from commit-message-derived or TODOS.md-derived discrepancies.** These are informational in the review output but too noisy for durable memory.

### Integration with Scope Drift Detection

The plan completion results augment the existing Scope Drift Detection. If a plan file is found:

- **NOT DONE items** become additional evidence for **MISSING REQUIREMENTS** in the scope drift report.
- **Items in the diff that don't match any plan item** become evidence for **SCOPE CREEP** detection.
- **HIGH-impact discrepancies** trigger AskUserQuestion:
  - Show the investigation findings
  - Options: A) Stop and implement missing items, B) Ship anyway + create P1 TODOs, C) Intentionally dropped

This is **INFORMATIONAL** — does not block the review (consistent with existing scope drift behavior).
This is **INFORMATIONAL** unless HIGH-impact discrepancies are found (then it gates via AskUserQuestion).

Update the scope drift output to include plan file context:



@@ 836,11 891,11 @@ Intent: <from plan file — 1-line summary>
Plan: <plan file path>
Delivered: <1-line summary of what the diff actually does>
Plan items: N DONE, M PARTIAL, K NOT DONE
[If NOT DONE: list each missing item]
[If NOT DONE: list each missing item with investigation]
[If scope creep: list each out-of-scope change not in the plan]
\`\`\`

**No plan file found:** Fall back to existing scope drift behavior (check TODOS.md and PR description only).`);
**No plan file found:** Use commit messages and TODOS.md as fallback sources (see above). If no intent sources at all, skip with: "No intent sources detected — skipping completion audit."`);
  }

  return sections.join('\n');

A test/diff-scope.test.ts => test/diff-scope.test.ts +165 -0
@@ 0,0 1,165 @@
/**
 * Tests for bin/gstack-diff-scope — verifies scope signal detection.
 *
 * Creates temp git repos with specific file patterns and verifies
 * the correct SCOPE_* variables are output.
 */
import { describe, test, expect, afterAll } from 'bun:test';
import { mkdtempSync, writeFileSync, mkdirSync, rmSync } from 'fs';
import { join } from 'path';
import { tmpdir } from 'os';
import { spawnSync } from 'child_process';

const SCRIPT = join(import.meta.dir, '..', 'bin', 'gstack-diff-scope');

const dirs: string[] = [];

function createRepo(files: string[]): string {
  const dir = mkdtempSync(join(tmpdir(), 'diff-scope-test-'));
  dirs.push(dir);

  const run = (cmd: string, args: string[]) =>
    spawnSync(cmd, args, { cwd: dir, stdio: 'pipe', timeout: 5000 });

  run('git', ['init', '-b', 'main']);
  run('git', ['config', 'user.email', 'test@test.com']);
  run('git', ['config', 'user.name', 'Test']);

  // Base commit
  writeFileSync(join(dir, 'README.md'), '# test\n');
  run('git', ['add', '.']);
  run('git', ['commit', '-m', 'initial']);

  // Feature branch with specified files
  run('git', ['checkout', '-b', 'feature/test']);
  for (const f of files) {
    const fullPath = join(dir, f);
    const dirPath = fullPath.substring(0, fullPath.lastIndexOf('/'));
    if (dirPath !== dir) mkdirSync(dirPath, { recursive: true });
    writeFileSync(fullPath, '# test content\n');
  }
  run('git', ['add', '.']);
  run('git', ['commit', '-m', 'add files']);

  return dir;
}

function runScope(dir: string): Record<string, string> {
  const result = spawnSync('bash', [SCRIPT, 'main'], {
    cwd: dir, stdio: 'pipe', timeout: 5000,
  });
  const output = result.stdout.toString().trim();
  const vars: Record<string, string> = {};
  for (const line of output.split('\n')) {
    const [key, val] = line.split('=');
    if (key && val) vars[key] = val;
  }
  return vars;
}

afterAll(() => {
  for (const d of dirs) {
    try { rmSync(d, { recursive: true, force: true }); } catch {}
  }
});

describe('gstack-diff-scope', () => {
  // --- Existing scope signals ---

  test('detects frontend files', () => {
    const dir = createRepo(['styles.css', 'component.tsx']);
    const scope = runScope(dir);
    expect(scope.SCOPE_FRONTEND).toBe('true');
  });

  test('detects backend files', () => {
    const dir = createRepo(['app.rb', 'service.py']);
    const scope = runScope(dir);
    expect(scope.SCOPE_BACKEND).toBe('true');
  });

  test('detects test files', () => {
    const dir = createRepo(['test/app.test.ts']);
    const scope = runScope(dir);
    expect(scope.SCOPE_TESTS).toBe('true');
  });

  // --- New scope signals (Review Army) ---

  test('detects migrations via db/migrate/', () => {
    const dir = createRepo(['db/migrate/20260330_create_users.rb']);
    const scope = runScope(dir);
    expect(scope.SCOPE_MIGRATIONS).toBe('true');
  });

  test('detects migrations via generic migrations/', () => {
    const dir = createRepo(['app/migrations/0001_initial.py']);
    const scope = runScope(dir);
    expect(scope.SCOPE_MIGRATIONS).toBe('true');
  });

  test('detects migrations via prisma', () => {
    const dir = createRepo(['prisma/migrations/20260330/migration.sql']);
    const scope = runScope(dir);
    expect(scope.SCOPE_MIGRATIONS).toBe('true');
  });

  test('detects API via controller files', () => {
    const dir = createRepo(['app/controllers/users_controller.rb']);
    const scope = runScope(dir);
    expect(scope.SCOPE_API).toBe('true');
  });

  test('detects API via route files', () => {
    const dir = createRepo(['src/routes/api.ts']);
    const scope = runScope(dir);
    expect(scope.SCOPE_API).toBe('true');
  });

  test('detects API via GraphQL schemas', () => {
    const dir = createRepo(['schema.graphql']);
    const scope = runScope(dir);
    expect(scope.SCOPE_API).toBe('true');
  });

  test('detects auth files', () => {
    const dir = createRepo(['app/services/auth_service.rb']);
    const scope = runScope(dir);
    expect(scope.SCOPE_AUTH).toBe('true');
  });

  test('detects session files', () => {
    const dir = createRepo(['lib/session_manager.ts']);
    const scope = runScope(dir);
    expect(scope.SCOPE_AUTH).toBe('true');
  });

  test('detects JWT files', () => {
    const dir = createRepo(['utils/jwt_helper.py']);
    const scope = runScope(dir);
    expect(scope.SCOPE_AUTH).toBe('true');
  });

  test('returns false for all new signals when no matching files', () => {
    const dir = createRepo(['docs/readme.md', 'config.yml']);
    const scope = runScope(dir);
    expect(scope.SCOPE_MIGRATIONS).toBe('false');
    expect(scope.SCOPE_API).toBe('false');
    expect(scope.SCOPE_AUTH).toBe('false');
  });

  test('outputs all 9 scope variables', () => {
    const dir = createRepo(['app.ts']);
    const scope = runScope(dir);
    expect(Object.keys(scope)).toHaveLength(9);
    expect(scope).toHaveProperty('SCOPE_FRONTEND');
    expect(scope).toHaveProperty('SCOPE_BACKEND');
    expect(scope).toHaveProperty('SCOPE_PROMPTS');
    expect(scope).toHaveProperty('SCOPE_TESTS');
    expect(scope).toHaveProperty('SCOPE_DOCS');
    expect(scope).toHaveProperty('SCOPE_CONFIG');
    expect(scope).toHaveProperty('SCOPE_MIGRATIONS');
    expect(scope).toHaveProperty('SCOPE_API');
    expect(scope).toHaveProperty('SCOPE_AUTH');
  });
});

A test/fixtures/review-army-migration.sql => test/fixtures/review-army-migration.sql +5 -0
@@ 0,0 1,5 @@
-- Migration: Drop user email column
-- WARNING: This migration is intentionally unsafe for testing
ALTER TABLE users DROP COLUMN email;
ALTER TABLE users DROP COLUMN phone_number;
-- No backfill, no reversibility check, no data preservation

A test/fixtures/review-army-n-plus-one.rb => test/fixtures/review-army-n-plus-one.rb +12 -0
@@ 0,0 1,12 @@
# N+1 query example — intentionally bad for testing
class PostsController
  def index
    @posts = Post.all
    @posts.each do |post|
      # N+1: queries Author table for every post
      puts post.author.name
      # N+1: queries Comments table for every post
      puts post.comments.count
    end
  end
end

M test/gen-skill-docs.test.ts => test/gen-skill-docs.test.ts +55 -21
@@ 607,7 607,8 @@ describe('TEST_COVERAGE_AUDIT placeholders', () => {
  const shipSkill = fs.readFileSync(path.join(ROOT, 'ship', 'SKILL.md'), 'utf-8');
  const reviewSkill = fs.readFileSync(path.join(ROOT, 'review', 'SKILL.md'), 'utf-8');

  test('all three modes share codepath tracing methodology', () => {
  test('plan and ship modes share codepath tracing methodology', () => {
    // Review mode delegates test coverage to the Testing specialist subagent (Review Army)
    const sharedPhrases = [
      'Trace data flow',
      'Diagram the execution',


@@ 619,33 620,40 @@ describe('TEST_COVERAGE_AUDIT placeholders', () => {
    for (const phrase of sharedPhrases) {
      expect(planSkill).toContain(phrase);
      expect(shipSkill).toContain(phrase);
      expect(reviewSkill).toContain(phrase);
    }
    // Plan mode traces the plan, not a git diff
    expect(planSkill).toContain('Trace every codepath in the plan');
    expect(planSkill).not.toContain('git diff origin');
    // Ship and review modes trace the diff
    // Ship mode traces the diff
    expect(shipSkill).toContain('Trace every codepath changed');
    expect(reviewSkill).toContain('Trace every codepath changed');
  });

  test('all three modes include E2E decision matrix', () => {
    for (const skill of [planSkill, shipSkill, reviewSkill]) {
  test('review mode uses Review Army for specialist dispatch', () => {
    expect(reviewSkill).toContain('Review Army');
    expect(reviewSkill).toContain('Specialist Dispatch');
    expect(reviewSkill).toContain('testing.md');
  });

  test('plan and ship modes include E2E decision matrix', () => {
    // Review mode delegates to Testing specialist
    for (const skill of [planSkill, shipSkill]) {
      expect(skill).toContain('E2E Test Decision Matrix');
      expect(skill).toContain('→E2E');
      expect(skill).toContain('→EVAL');
    }
  });

  test('all three modes include regression rule', () => {
    for (const skill of [planSkill, shipSkill, reviewSkill]) {
  test('plan and ship modes include regression rule', () => {
    // Review mode delegates to Testing specialist
    for (const skill of [planSkill, shipSkill]) {
      expect(skill).toContain('REGRESSION RULE');
      expect(skill).toContain('IRON RULE');
    }
  });

  test('all three modes include test framework detection', () => {
    for (const skill of [planSkill, shipSkill, reviewSkill]) {
  test('plan and ship modes include test framework detection', () => {
    // Review mode delegates to Testing specialist
    for (const skill of [planSkill, shipSkill]) {
      expect(skill).toContain('Test Framework Detection');
      expect(skill).toContain('CLAUDE.md');
    }


@@ 664,11 672,12 @@ describe('TEST_COVERAGE_AUDIT placeholders', () => {
    expect(shipSkill).toContain('ship-test-plan');
  });

  test('review mode generates via Fix-First + gaps are INFORMATIONAL', () => {
  test('review mode uses Fix-First + Review Army for specialist coverage', () => {
    expect(reviewSkill).toContain('Fix-First');
    expect(reviewSkill).toContain('INFORMATIONAL');
    expect(reviewSkill).toContain('Step 4.75');
    expect(reviewSkill).toContain('subsumes the "Test Gaps" category');
    // Review Army handles test coverage via Testing specialist subagent
    expect(reviewSkill).toContain('Review Army');
    expect(reviewSkill).toContain('Testing');
  });

  test('plan mode does NOT include ship-specific content', () => {


@@ 683,6 692,35 @@ describe('TEST_COVERAGE_AUDIT placeholders', () => {
    expect(reviewSkill).not.toContain('ship-test-plan');
  });

  test('review/specialists/ directory has all expected checklist files', () => {
    const specDir = path.join(ROOT, 'review', 'specialists');
    const expected = [
      'testing.md',
      'maintainability.md',
      'security.md',
      'performance.md',
      'data-migration.md',
      'api-contract.md',
      'red-team.md',
    ];
    for (const f of expected) {
      expect(fs.existsSync(path.join(specDir, f))).toBe(true);
    }
  });

  test('each specialist file has standard header with scope and output format', () => {
    const specDir = path.join(ROOT, 'review', 'specialists');
    const files = fs.readdirSync(specDir).filter(f => f.endsWith('.md'));
    for (const f of files) {
      const content = fs.readFileSync(path.join(specDir, f), 'utf-8');
      // All specialist files must have Scope and Output/JSON in header
      expect(content).toContain('Scope:');
      expect(content.toLowerCase()).toMatch(/output|json/);
      // Must define NO FINDINGS behavior
      expect(content).toContain('NO FINDINGS');
    }
  });

  // Regression guard: ship output contains key phrases from before the refactor
  test('ship SKILL.md regression guard — key phrases preserved', () => {
    const regressionPhrases = [


@@ 870,12 908,9 @@ describe('Coverage gate in ship', () => {
    expect(shipSkill).toContain('could not determine percentage — skipping');
  });

  test('review SKILL.md contains coverage WARNING', () => {
    expect(reviewSkill).toContain('COVERAGE WARNING');
    expect(reviewSkill).toContain('Consider writing tests before running /ship');
  });

  test('review coverage warning is INFORMATIONAL', () => {
  test('review SKILL.md delegates coverage to Testing specialist', () => {
    // Coverage audit moved to Testing specialist subagent in Review Army
    expect(reviewSkill).toContain('testing.md');
    expect(reviewSkill).toContain('INFORMATIONAL');
  });
});


@@ 1604,10 1639,9 @@ describe('Codex generation (--host codex)', () => {
    const content = fs.readFileSync(path.join(AGENTS_DIR, 'gstack-review', 'SKILL.md'), 'utf-8');
    // Correct: references to sidecar files use gstack/review/ path
    expect(content).toContain('.agents/skills/gstack/review/checklist.md');
    expect(content).toContain('.agents/skills/gstack/review/design-checklist.md');
    // design-checklist.md is now referenced via Review Army specialist (Claude only, stripped for Codex)
    // Wrong: must NOT reference gstack-review/checklist.md (file doesn't exist there)
    expect(content).not.toContain('.agents/skills/gstack-review/checklist.md');
    expect(content).not.toContain('.agents/skills/gstack-review/design-checklist.md');
  });

  test('sidecar paths in ship skill point to gstack/review/ for pre-landing review', () => {

M test/helpers/touchfiles.ts => test/helpers/touchfiles.ts +18 -0
@@ 59,6 59,15 @@ export const E2E_TOUCHFILES: Record<string, string[]> = {
  'review-base-branch':       ['review/**'],
  'review-design-lite':       ['review/**', 'test/fixtures/review-eval-design-slop.*'],

  // Review Army (specialist dispatch)
  'review-army-migration-safety': ['review/**', 'scripts/resolvers/review-army.ts', 'bin/gstack-diff-scope'],
  'review-army-perf-n-plus-one':  ['review/**', 'scripts/resolvers/review-army.ts', 'bin/gstack-diff-scope'],
  'review-army-delivery-audit':   ['review/**', 'scripts/resolvers/review.ts', 'scripts/resolvers/review-army.ts'],
  'review-army-quality-score':    ['review/**', 'scripts/resolvers/review-army.ts'],
  'review-army-json-findings':    ['review/**', 'scripts/resolvers/review-army.ts'],
  'review-army-red-team':         ['review/**', 'scripts/resolvers/review-army.ts'],
  'review-army-consensus':        ['review/**', 'scripts/resolvers/review-army.ts'],

  // Office Hours
  'office-hours-spec-review':  ['office-hours/**', 'scripts/gen-skill-docs.ts'],



@@ 204,6 213,15 @@ export const E2E_TIERS: Record<string, 'gate' | 'periodic'> = {
  'review-plan-completion': 'gate',
  'review-dashboard-via': 'gate',

  // Review Army — gate for core functionality, periodic for multi-specialist
  'review-army-migration-safety': 'gate',   // Specialist activation guardrail
  'review-army-perf-n-plus-one': 'gate',    // Specialist activation guardrail
  'review-army-delivery-audit': 'gate',     // Delivery integrity guardrail
  'review-army-quality-score': 'gate',      // Score computation
  'review-army-json-findings': 'gate',      // JSON schema compliance
  'review-army-red-team': 'periodic',       // Multi-agent coordination
  'review-army-consensus': 'periodic',      // Multi-specialist agreement

  // Office Hours
  'office-hours-spec-review': 'gate',


A test/skill-e2e-review-army.test.ts => test/skill-e2e-review-army.test.ts +562 -0
@@ 0,0 1,562 @@
import { describe, test, expect, beforeAll, afterAll } from 'bun:test';
import { runSkillTest } from './helpers/session-runner';
import {
  ROOT, runId, describeIfSelected, testConcurrentIfSelected,
  logCost, recordE2E, createEvalCollector, finalizeEvalCollector,
} from './helpers/e2e-helpers';
import { spawnSync } from 'child_process';
import * as fs from 'fs';
import * as path from 'path';
import * as os from 'os';

const evalCollector = createEvalCollector('e2e-review-army');

// Helper: create a git repo with a feature branch
function setupRepo(prefix: string): { dir: string; run: (cmd: string, args: string[]) => void } {
  const dir = fs.mkdtempSync(path.join(os.tmpdir(), `skill-e2e-${prefix}-`));
  const run = (cmd: string, args: string[]) =>
    spawnSync(cmd, args, { cwd: dir, stdio: 'pipe', timeout: 5000 });
  run('git', ['init', '-b', 'main']);
  run('git', ['config', 'user.email', 'test@test.com']);
  run('git', ['config', 'user.name', 'Test']);
  return { dir, run };
}

// Helper: copy review skill files to test dir
function copyReviewFiles(dir: string) {
  fs.copyFileSync(path.join(ROOT, 'review', 'SKILL.md'), path.join(dir, 'review-SKILL.md'));
  fs.copyFileSync(path.join(ROOT, 'review', 'checklist.md'), path.join(dir, 'review-checklist.md'));
  fs.copyFileSync(path.join(ROOT, 'review', 'greptile-triage.md'), path.join(dir, 'review-greptile-triage.md'));
  // Copy specialist checklists
  const specDir = path.join(dir, 'review-specialists');
  fs.mkdirSync(specDir, { recursive: true });
  const specialistsRoot = path.join(ROOT, 'review', 'specialists');
  for (const f of fs.readdirSync(specialistsRoot)) {
    fs.copyFileSync(path.join(specialistsRoot, f), path.join(specDir, f));
  }
}

// --- Review Army: Migration Safety ---

describeIfSelected('Review Army: Migration Safety', ['review-army-migration-safety'], () => {
  let dir: string;

  beforeAll(() => {
    const repo = setupRepo('army-migration');
    dir = repo.dir;

    // Base commit
    fs.writeFileSync(path.join(dir, 'app.rb'), '# base\n');
    repo.run('git', ['add', '.']);
    repo.run('git', ['commit', '-m', 'initial']);

    // Feature branch with unsafe migration
    repo.run('git', ['checkout', '-b', 'feature/drop-columns']);
    fs.mkdirSync(path.join(dir, 'db', 'migrate'), { recursive: true });
    const migrationContent = fs.readFileSync(
      path.join(ROOT, 'test', 'fixtures', 'review-army-migration.sql'), 'utf-8'
    );
    fs.writeFileSync(path.join(dir, 'db', 'migrate', '20260330_drop_columns.sql'), migrationContent);
    repo.run('git', ['add', '.']);
    repo.run('git', ['commit', '-m', 'drop email and phone columns']);

    copyReviewFiles(dir);
  });

  afterAll(() => { try { fs.rmSync(dir, { recursive: true, force: true }); } catch {} });

  testConcurrentIfSelected('review-army-migration-safety', async () => {
    const result = await runSkillTest({
      prompt: `You are in a git repo on a feature branch with a database migration that drops columns.
Read review-SKILL.md for instructions. Also read review-checklist.md.
The specialist checklists are in review-specialists/ (testing.md, security.md, performance.md, data-migration.md, etc.).

Skip the preamble, lake intro, telemetry sections.
Run Step 4 (Critical pass) then Step 4.5 (Review Army — Specialist Dispatch).
The base branch is main. Run gstack-diff-scope style analysis on the changed files.
Since db/migrate/ files changed, the Data Migration specialist should activate.

For the specialist dispatch, instead of launching subagents, just read review-specialists/data-migration.md
and apply it yourself against the diff (git diff main...HEAD).

Write your findings to ${dir}/review-output.md`,
      workingDirectory: dir,
      maxTurns: 20,
      timeout: 180_000,
      testName: 'review-army-migration-safety',
      runId,
    });

    logCost('/review army migration', result);
    recordE2E(evalCollector, '/review army migration safety', 'Review Army', result);
    expect(result.exitReason).toBe('success');

    // Verify migration issues were caught
    const outputPath = path.join(dir, 'review-output.md');
    if (fs.existsSync(outputPath)) {
      const content = fs.readFileSync(outputPath, 'utf-8').toLowerCase();
      const hasMigrationFinding =
        content.includes('drop') ||
        content.includes('data loss') ||
        content.includes('reversib') ||
        content.includes('migration') ||
        content.includes('column');
      expect(hasMigrationFinding).toBe(true);
    }
  }, 210_000);
});

// --- Review Army: N+1 Performance ---

describeIfSelected('Review Army: N+1 Performance', ['review-army-perf-n-plus-one'], () => {
  let dir: string;

  beforeAll(() => {
    const repo = setupRepo('army-n-plus-one');
    dir = repo.dir;

    fs.writeFileSync(path.join(dir, 'app.rb'), '# base\n');
    repo.run('git', ['add', '.']);
    repo.run('git', ['commit', '-m', 'initial']);

    repo.run('git', ['checkout', '-b', 'feature/add-posts-index']);
    const n1Content = fs.readFileSync(
      path.join(ROOT, 'test', 'fixtures', 'review-army-n-plus-one.rb'), 'utf-8'
    );
    fs.writeFileSync(path.join(dir, 'posts_controller.rb'), n1Content);
    repo.run('git', ['add', '.']);
    repo.run('git', ['commit', '-m', 'add posts controller']);

    copyReviewFiles(dir);
  });

  afterAll(() => { try { fs.rmSync(dir, { recursive: true, force: true }); } catch {} });

  testConcurrentIfSelected('review-army-perf-n-plus-one', async () => {
    const result = await runSkillTest({
      prompt: `You are in a git repo on a feature branch with a Ruby controller that has N+1 queries.
Read review-SKILL.md for instructions. Also read review-checklist.md.
The specialist checklists are in review-specialists/ (testing.md, performance.md, etc.).

Skip the preamble, lake intro, telemetry sections.
Run Step 4 (Critical pass) then Step 4.5 (Review Army).
The base branch is main. This is a Ruby backend file, so Performance specialist should activate.

For the specialist dispatch, read review-specialists/performance.md and apply it against the diff.

Write your findings to ${dir}/review-output.md`,
      workingDirectory: dir,
      maxTurns: 20,
      timeout: 180_000,
      testName: 'review-army-perf-n-plus-one',
      runId,
    });

    logCost('/review army n+1', result);
    recordE2E(evalCollector, '/review army N+1 detection', 'Review Army', result);
    expect(result.exitReason).toBe('success');

    const outputPath = path.join(dir, 'review-output.md');
    if (fs.existsSync(outputPath)) {
      const content = fs.readFileSync(outputPath, 'utf-8').toLowerCase();
      const hasN1Finding =
        content.includes('n+1') ||
        content.includes('n + 1') ||
        content.includes('eager') ||
        content.includes('includes') ||
        content.includes('preload') ||
        content.includes('query') ||
        content.includes('loop');
      expect(hasN1Finding).toBe(true);
    }
  }, 210_000);
});

// --- Review Army: Delivery Audit ---

describeIfSelected('Review Army: Delivery Audit', ['review-army-delivery-audit'], () => {
  let dir: string;

  beforeAll(() => {
    const repo = setupRepo('army-delivery');
    dir = repo.dir;

    fs.writeFileSync(path.join(dir, 'app.rb'), '# base\n');
    repo.run('git', ['add', '.']);
    repo.run('git', ['commit', '-m', 'initial']);

    repo.run('git', ['checkout', '-b', 'feature/three-features']);

    // Write a plan file promising 3 features
    fs.writeFileSync(path.join(dir, 'PLAN.md'), `# Feature Plan

## Implementation Items
1. Add user authentication with login/logout
2. Add user profile page with avatar upload
3. Add email notification system for new signups

## Test Items
- Test login flow
- Test profile page rendering
- Test email sending
`);
    repo.run('git', ['add', 'PLAN.md']);
    repo.run('git', ['commit', '-m', 'add plan']);

    // Implement only 2 of 3 features
    fs.writeFileSync(path.join(dir, 'auth.rb'), `class AuthController
  def login
    # authenticate user
    session[:user_id] = user.id
  end

  def logout
    session.delete(:user_id)
  end
end
`);
    fs.writeFileSync(path.join(dir, 'profile.rb'), `class ProfileController
  def show
    @user = User.find(params[:id])
  end

  def update_avatar
    @user.avatar.attach(params[:avatar])
  end
end
`);
    // NOTE: email notification system is NOT implemented (intentionally missing)
    repo.run('git', ['add', '.']);
    repo.run('git', ['commit', '-m', 'implement auth and profile features']);

    copyReviewFiles(dir);
  });

  afterAll(() => { try { fs.rmSync(dir, { recursive: true, force: true }); } catch {} });

  testConcurrentIfSelected('review-army-delivery-audit', async () => {
    const result = await runSkillTest({
      prompt: `You are in a git repo on branch feature/three-features.
There is a PLAN.md file that promises 3 features: auth, profile, and email notifications.
The diff (git diff main...HEAD) only implements 2 of them (auth and profile).

Read review-SKILL.md for the review workflow. Focus on the Plan Completion Audit section.
The plan file is at ./PLAN.md. Cross-reference it against the diff.

For each plan item, classify as DONE, PARTIAL, NOT DONE, or CHANGED.
The email notification system should be classified as NOT DONE.

Write your completion audit to ${dir}/review-output.md`,
      workingDirectory: dir,
      maxTurns: 15,
      timeout: 120_000,
      testName: 'review-army-delivery-audit',
      runId,
    });

    logCost('/review army delivery', result);
    recordE2E(evalCollector, '/review army delivery audit', 'Review Army', result);
    expect(result.exitReason).toBe('success');

    const outputPath = path.join(dir, 'review-output.md');
    if (fs.existsSync(outputPath)) {
      const content = fs.readFileSync(outputPath, 'utf-8').toLowerCase();
      // Should identify email notifications as NOT DONE
      const hasNotDone =
        content.includes('not done') ||
        content.includes('not_done') ||
        content.includes('missing') ||
        content.includes('not implemented');
      const mentionsEmail =
        content.includes('email') ||
        content.includes('notification');
      expect(hasNotDone).toBe(true);
      expect(mentionsEmail).toBe(true);
    }
  }, 150_000);
});

// --- Review Army: Quality Score ---

describeIfSelected('Review Army: Quality Score', ['review-army-quality-score'], () => {
  let dir: string;

  beforeAll(() => {
    const repo = setupRepo('army-quality');
    dir = repo.dir;

    fs.writeFileSync(path.join(dir, 'app.rb'), '# base\n');
    repo.run('git', ['add', '.']);
    repo.run('git', ['commit', '-m', 'initial']);

    repo.run('git', ['checkout', '-b', 'feature/add-controller']);
    // Code with obvious issues for quality score computation
    fs.writeFileSync(path.join(dir, 'user_controller.rb'), `class UserController
  def create
    # SQL injection
    User.where("name = '#{params[:name]}'")
    # Magic number
    if users.count > 42
      raise "too many"
    end
  end
end
`);
    repo.run('git', ['add', '.']);
    repo.run('git', ['commit', '-m', 'add user controller']);

    copyReviewFiles(dir);
  });

  afterAll(() => { try { fs.rmSync(dir, { recursive: true, force: true }); } catch {} });

  testConcurrentIfSelected('review-army-quality-score', async () => {
    const result = await runSkillTest({
      prompt: `You are in a git repo with a vulnerable user controller.
Read review-SKILL.md and review-checklist.md.
Skip preamble, lake intro, telemetry.

Run the Critical pass (Step 4) against the diff (git diff main...HEAD).
Then compute the PR Quality Score as described in the Review Army merge step:
quality_score = max(0, 10 - (critical_count * 2 + informational_count * 0.5))

Write your findings AND the computed quality score to ${dir}/review-output.md
Include the line: "PR Quality Score: X/10" where X is the computed score.`,
      workingDirectory: dir,
      maxTurns: 15,
      timeout: 120_000,
      testName: 'review-army-quality-score',
      runId,
    });

    logCost('/review army quality', result);
    recordE2E(evalCollector, '/review army quality score', 'Review Army', result);
    expect(result.exitReason).toBe('success');

    const outputPath = path.join(dir, 'review-output.md');
    if (fs.existsSync(outputPath)) {
      const content = fs.readFileSync(outputPath, 'utf-8');
      // Should contain a quality score
      const hasScore =
        content.toLowerCase().includes('quality score') ||
        content.match(/\d+\/10/);
      expect(hasScore).toBeTruthy();
    }
  }, 150_000);
});

// --- Review Army: JSON Findings ---

describeIfSelected('Review Army: JSON Findings', ['review-army-json-findings'], () => {
  let dir: string;

  beforeAll(() => {
    const repo = setupRepo('army-json');
    dir = repo.dir;

    fs.writeFileSync(path.join(dir, 'app.rb'), '# base\n');
    repo.run('git', ['add', '.']);
    repo.run('git', ['commit', '-m', 'initial']);

    repo.run('git', ['checkout', '-b', 'feature/vuln']);
    fs.writeFileSync(path.join(dir, 'search.rb'), `class SearchController
  def index
    # SQL injection via string interpolation
    results = ActiveRecord::Base.connection.execute(
      "SELECT * FROM products WHERE name LIKE '%#{params[:q]}%'"
    )
    render json: results
  end
end
`);
    repo.run('git', ['add', '.']);
    repo.run('git', ['commit', '-m', 'add search']);

    copyReviewFiles(dir);
  });

  afterAll(() => { try { fs.rmSync(dir, { recursive: true, force: true }); } catch {} });

  testConcurrentIfSelected('review-army-json-findings', async () => {
    const result = await runSkillTest({
      prompt: `You are reviewing a git diff with a SQL injection vulnerability.
Read review-specialists/security.md for the security checklist.

Apply the checklist against this diff (git diff main...HEAD).
Output your findings as JSON objects, one per line, following the schema:
{"severity":"CRITICAL","confidence":9,"path":"search.rb","line":4,"category":"injection","summary":"SQL injection via string interpolation","fix":"Use parameterized query","fingerprint":"search.rb:4:injection","specialist":"security"}

Write ONLY JSON findings (no preamble) to ${dir}/findings.json`,
      workingDirectory: dir,
      maxTurns: 12,
      timeout: 90_000,
      testName: 'review-army-json-findings',
      runId,
    });

    logCost('/review army json', result);
    recordE2E(evalCollector, '/review army JSON findings', 'Review Army', result);
    expect(result.exitReason).toBe('success');

    const findingsPath = path.join(dir, 'findings.json');
    if (fs.existsSync(findingsPath)) {
      const content = fs.readFileSync(findingsPath, 'utf-8').trim();
      const lines = content.split('\n').filter(l => l.trim());
      // At least one finding
      expect(lines.length).toBeGreaterThanOrEqual(1);
      // Each line should be valid JSON with required fields
      for (const line of lines) {
        let parsed: any;
        try { parsed = JSON.parse(line); } catch { continue; }
        // Required fields per schema
        expect(parsed).toHaveProperty('severity');
        expect(parsed).toHaveProperty('confidence');
        expect(parsed).toHaveProperty('path');
        expect(parsed).toHaveProperty('category');
        expect(parsed).toHaveProperty('summary');
        expect(parsed).toHaveProperty('specialist');
        break; // One valid line is enough for the gate test
      }
    }
  }, 120_000);
});

// --- Review Army: Red Team (periodic) ---

describeIfSelected('Review Army: Red Team', ['review-army-red-team'], () => {
  let dir: string;

  beforeAll(() => {
    const repo = setupRepo('army-redteam');
    dir = repo.dir;

    fs.writeFileSync(path.join(dir, 'app.rb'), '# base\n');
    repo.run('git', ['add', '.']);
    repo.run('git', ['commit', '-m', 'initial']);

    repo.run('git', ['checkout', '-b', 'feature/large-change']);
    // Create a large diff (300+ lines)
    const lines: string[] = ['class LargeController'];
    for (let i = 0; i < 100; i++) {
      lines.push(`  def method_${i}`);
      lines.push(`    data = params[:input_${i}]`);
      lines.push(`    process(data)`);
      lines.push('  end');
      lines.push('');
    }
    lines.push('end');
    fs.writeFileSync(path.join(dir, 'large_controller.rb'), lines.join('\n'));
    repo.run('git', ['add', '.']);
    repo.run('git', ['commit', '-m', 'add large controller']);

    copyReviewFiles(dir);
  });

  afterAll(() => { try { fs.rmSync(dir, { recursive: true, force: true }); } catch {} });

  testConcurrentIfSelected('review-army-red-team', async () => {
    const result = await runSkillTest({
      prompt: `You are reviewing a large diff (300+ lines). Read review-SKILL.md.
Skip preamble, lake intro, telemetry.

The diff is large enough to activate the Red Team specialist.
Read review-specialists/red-team.md and apply it against the diff (git diff main...HEAD).
Focus on finding issues that other specialists might miss.

Write your red team findings to ${dir}/review-output.md
Start the file with "RED TEAM REVIEW" on the first line.`,
      workingDirectory: dir,
      maxTurns: 20,
      timeout: 180_000,
      testName: 'review-army-red-team',
      runId,
    });

    logCost('/review army red-team', result);
    recordE2E(evalCollector, '/review army red team', 'Review Army', result);
    expect(result.exitReason).toBe('success');

    const outputPath = path.join(dir, 'review-output.md');
    if (fs.existsSync(outputPath)) {
      const content = fs.readFileSync(outputPath, 'utf-8');
      expect(content.toLowerCase()).toMatch(/red team|adversarial/);
    }
  }, 210_000);
});

// --- Review Army: Consensus (periodic) ---

describeIfSelected('Review Army: Consensus', ['review-army-consensus'], () => {
  let dir: string;

  beforeAll(() => {
    const repo = setupRepo('army-consensus');
    dir = repo.dir;

    fs.writeFileSync(path.join(dir, 'app.rb'), '# base\n');
    repo.run('git', ['add', '.']);
    repo.run('git', ['commit', '-m', 'initial']);

    repo.run('git', ['checkout', '-b', 'feature/vuln-auth']);
    // SQL injection that both security AND testing specialists should flag
    fs.writeFileSync(path.join(dir, 'auth_controller.rb'), `class AuthController
  def login
    user = User.find_by("email = '#{params[:email]}' AND password = '#{params[:password]}'")
    if user
      session[:user_id] = user.id
      redirect_to root_path
    else
      flash[:error] = "Invalid credentials"
      render :login
    end
  end
end
`);
    repo.run('git', ['add', '.']);
    repo.run('git', ['commit', '-m', 'add auth controller']);

    copyReviewFiles(dir);
  });

  afterAll(() => { try { fs.rmSync(dir, { recursive: true, force: true }); } catch {} });

  testConcurrentIfSelected('review-army-consensus', async () => {
    const result = await runSkillTest({
      prompt: `You are reviewing a git diff with a SQL injection in an auth controller.
Read review-SKILL.md, review-checklist.md, and the specialist checklists in review-specialists/.

This vulnerability should be caught by BOTH the security specialist (injection vector)
AND the testing specialist (no test for auth bypass).

Run the review. In your output, if a finding is flagged by multiple perspectives,
mark it as "MULTI-SPECIALIST CONFIRMED" with the confirming categories.

Write findings to ${dir}/review-output.md`,
      workingDirectory: dir,
      maxTurns: 20,
      timeout: 180_000,
      testName: 'review-army-consensus',
      runId,
    });

    logCost('/review army consensus', result);
    recordE2E(evalCollector, '/review army consensus', 'Review Army', result);
    expect(result.exitReason).toBe('success');

    const outputPath = path.join(dir, 'review-output.md');
    if (fs.existsSync(outputPath)) {
      const content = fs.readFileSync(outputPath, 'utf-8').toLowerCase();
      // Should catch the SQL injection
      const hasSqlFinding =
        content.includes('sql') ||
        content.includes('injection') ||
        content.includes('interpolat');
      expect(hasSqlFinding).toBe(true);
    }
  }, 210_000);
});

// Finalize eval collector
afterAll(async () => {
  await finalizeEvalCollector(evalCollector);
});