From a2d756f945854a8ad8b61f8c84021f0f221d223c Mon Sep 17 00:00:00 2001 From: Garry Tan Date: Tue, 17 Mar 2026 13:05:18 -0500 Subject: [PATCH] feat: Test Bootstrap + Regression Tests + Coverage Audit (v0.6.0) (#136) * feat: test bootstrap, regression tests, coverage audit, retro test health - Add {{TEST_BOOTSTRAP}} resolver to gen-skill-docs.ts - Add Phase 8e.5 regression test generation to /qa and /qa-design-review - Add Step 3.4 test coverage audit with quality scoring to /ship - Add test health tracking to /retro - Add 2 E2E evals (bootstrap + coverage audit) - Add 26 validation tests - Update ARCHITECTURE.md placeholder table - Add 2 P3 TODOs (CI/CD non-GitHub, auto-upgrade weak tests) * chore: bump version and changelog (v0.6.0) Co-Authored-By: Claude Opus 4.6 * feat: make coverage audit trace actual codepaths, not just syntax patterns Step 3.4 now instructs Claude to read full files, trace data flow through every branch, diagram the execution, and check each branch against tests. Phase 8e.5 regression tests now trace the bug's codepath before writing the test, catching adjacent edge cases. * feat: coverage audit now maps user flows, interactions, and error states Step 3.4 now covers the full picture: code branches AND user-facing behavior. Maps user flows (complete journey through the feature), interaction edge cases (double-click, back button, stale state, slow connection), error states (what does the user actually see?), and boundary states (zero results, 10k results, max-length input). Coverage diagram splits into Code Path Coverage and User Flow Coverage sections with separate percentages. * fix: raise test gen cap to 20, add validation tests for user flow coverage - Raise Step 3.4 test generation cap from 10 to 20 (code + user flow combined) - Add 3 validation tests: codepath tracing, user flow mapping, diagram sections --------- Co-authored-by: Claude Opus 4.6 --- ARCHITECTURE.md | 2 + CHANGELOG.md | 21 ++ TODOS.md | 24 +++ VERSION | 2 +- qa-design-review/SKILL.md | 170 +++++++++++++++- qa-design-review/SKILL.md.tmpl | 19 +- qa-only/SKILL.md | 1 + qa-only/SKILL.md.tmpl | 1 + qa/SKILL.md | 211 +++++++++++++++++++- qa/SKILL.md.tmpl | 60 +++++- qa/templates/qa-report-template.md | 16 ++ retro/SKILL.md | 29 ++- retro/SKILL.md.tmpl | 29 ++- scripts/gen-skill-docs.ts | 156 +++++++++++++++ ship/SKILL.md | 302 +++++++++++++++++++++++++++++ ship/SKILL.md.tmpl | 151 +++++++++++++++ test/skill-e2e.test.ts | 263 +++++++++++++++++++++++++ test/skill-validation.test.ts | 222 +++++++++++++++++++++ 18 files changed, 1672 insertions(+), 7 deletions(-) diff --git a/ARCHITECTURE.md b/ARCHITECTURE.md index bccb13ffa2513cc8ef919069b73ac6b40317bd4e..79bfda7516f9dcdc45ff83c6efee04a87621fc09 100644 --- a/ARCHITECTURE.md +++ b/ARCHITECTURE.md @@ -203,6 +203,8 @@ Templates contain the workflows, tips, and examples that require human judgment. | `{{BASE_BRANCH_DETECT}}` | `gen-skill-docs.ts` | Dynamic base branch detection for PR-targeting skills (ship, review, qa, plan-ceo-review) | | `{{QA_METHODOLOGY}}` | `gen-skill-docs.ts` | Shared QA methodology block for /qa and /qa-only | | `{{DESIGN_METHODOLOGY}}` | `gen-skill-docs.ts` | Shared design audit methodology for /plan-design-review and /qa-design-review | +| `{{REVIEW_DASHBOARD}}` | `gen-skill-docs.ts` | Review Readiness Dashboard for /ship pre-flight | +| `{{TEST_BOOTSTRAP}}` | `gen-skill-docs.ts` | Test framework detection, bootstrap, CI/CD setup for /qa, /ship, /qa-design-review | This is structurally sound — if a command exists in code, it appears in docs. If it doesn't exist, it can't appear. diff --git a/CHANGELOG.md b/CHANGELOG.md index 4a98b63558436614760eb01bc075025663b6fe25..12fa243b8bbd943a4a07f0a49004c7e936796394 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -1,5 +1,26 @@ # Changelog +## 0.6.0 — 2026-03-17 + +- **100% test coverage is the key to great vibe coding.** gstack now bootstraps test frameworks from scratch when your project doesn't have one. Detects your runtime, researches the best framework, asks you to pick, installs it, writes 3-5 real tests for your actual code, sets up CI/CD (GitHub Actions), creates TESTING.md, and adds test culture instructions to CLAUDE.md. Every Claude Code session after that writes tests naturally. +- **Every bug fix now gets a regression test.** When `/qa` fixes a bug and verifies it, Phase 8e.5 automatically generates a regression test that catches the exact scenario that broke. Tests include full attribution tracing back to the QA report. Auto-incrementing filenames prevent collisions across sessions. +- **Ship with confidence — coverage audit shows what's tested and what's not.** `/ship` Step 3.4 builds a code path map from your diff, searches for corresponding tests, and produces an ASCII coverage diagram with quality stars (★★★ = edge cases + errors, ★★ = happy path, ★ = smoke test). Gaps get tests auto-generated. PR body shows "Tests: 42 → 47 (+5 new)". +- **Your retro tracks test health.** `/retro` now shows total test files, tests added this period, regression test commits, and trend deltas. If test ratio drops below 20%, it flags it as a growth area. +- **Design reviews generate regression tests too.** `/qa-design-review` Phase 8e.5 skips CSS-only fixes (those are caught by re-running the design audit) but writes tests for JavaScript behavior changes like broken dropdowns or animation failures. + +### For contributors + +- Added `generateTestBootstrap()` resolver to `gen-skill-docs.ts` (~155 lines). Registered as `{{TEST_BOOTSTRAP}}` in the RESOLVERS map. Inserted into qa, ship (Step 2.5), and qa-design-review templates. +- Phase 8e.5 regression test generation added to `qa/SKILL.md.tmpl` (46 lines) and CSS-aware variant to `qa-design-review/SKILL.md.tmpl` (12 lines). Rule 13 amended to allow creating new test files. +- Step 3.4 test coverage audit added to `ship/SKILL.md.tmpl` (88 lines) with quality scoring rubric and ASCII diagram format. +- Test health tracking added to `retro/SKILL.md.tmpl`: 3 new data gathering commands, metrics row, narrative section, JSON schema field. +- `qa-only/SKILL.md.tmpl` gets recommendation note when no test framework detected. +- `qa-report-template.md` gains Regression Tests section with deferred test specs. +- ARCHITECTURE.md placeholder table updated with `{{TEST_BOOTSTRAP}}` and `{{REVIEW_DASHBOARD}}`. +- WebSearch added to allowed-tools for qa, ship, qa-design-review. +- 26 new validation tests, 2 new E2E evals (bootstrap + coverage audit). +- 2 new P3 TODOs: CI/CD for non-GitHub providers, auto-upgrade weak tests. + ## 0.5.4 — 2026-03-17 - **Engineering review is always the full review now.** `/plan-eng-review` no longer asks you to choose between "big change" and "small change" modes. Every plan gets the full interactive walkthrough (architecture, code quality, tests, performance). Scope reduction is only suggested when the complexity check actually triggers — not as a standing menu option. diff --git a/TODOS.md b/TODOS.md index f52bb69350e4bcadd1d8c5374086b6686379a6fc..a0801d854529e5030015916a792f6971dbffe7a2 100644 --- a/TODOS.md +++ b/TODOS.md @@ -263,6 +263,30 @@ **Effort:** S **Priority:** P3 +### CI/CD generation for non-GitHub providers + +**What:** Extend CI/CD bootstrap to generate GitLab CI (`.gitlab-ci.yml`), CircleCI (`.circleci/config.yml`), and Bitrise pipelines. + +**Why:** Not all projects use GitHub Actions. Universal CI/CD bootstrap would make test bootstrap work for everyone. + +**Context:** v1 ships with GitHub Actions only. Detection logic already checks for `.gitlab-ci.yml`, `.circleci/`, `bitrise.yml` and skips with an informational note. Each provider needs ~20 lines of template text in `generateTestBootstrap()`. + +**Effort:** M +**Priority:** P3 +**Depends on:** Test bootstrap (shipped) + +### Auto-upgrade weak tests (★) to strong tests (★★★) + +**What:** When Step 3.4 coverage audit identifies existing ★-rated tests (smoke/trivial assertions), generate improved versions testing edge cases and error paths. + +**Why:** Many codebases have tests that technically exist but don't catch real bugs — `expect(component).toBeDefined()` isn't testing behavior. Upgrading these closes the gap between "has tests" and "has good tests." + +**Context:** Requires the quality scoring rubric from the test coverage audit. Modifying existing test files is riskier than creating new ones — needs careful diffing to ensure the upgraded test still passes. Consider creating a companion test file rather than modifying the original. + +**Effort:** M +**Priority:** P3 +**Depends on:** Test quality scoring (shipped) + ## Retro ### Deployment health tracking (retro + browse) diff --git a/VERSION b/VERSION index 7d8568351b4f8d3809763af69f84d5499ef881d0..a918a2aa18d5bec6a8bb93891a7a63c243111796 100644 --- a/VERSION +++ b/VERSION @@ -1 +1 @@ -0.5.4 +0.6.0 diff --git a/qa-design-review/SKILL.md b/qa-design-review/SKILL.md index 0d8d0771d21341816d8daf9cdde512590bac1f10..7044c560abb3acc2beacbfdc6aa5529220a51cd5 100644 --- a/qa-design-review/SKILL.md +++ b/qa-design-review/SKILL.md @@ -14,6 +14,7 @@ allowed-tools: - Glob - Grep - AskUserQuestion + - WebSearch --- @@ -136,6 +137,161 @@ If `NEEDS_SETUP`: 2. Run: `cd && ./setup` 3. If `bun` is not installed: `curl -fsSL https://bun.sh/install | bash` +**Check test framework (bootstrap if needed):** + +## Test Framework Bootstrap + +**Detect existing test framework and project runtime:** + +```bash +# Detect project runtime +[ -f Gemfile ] && echo "RUNTIME:ruby" +[ -f package.json ] && echo "RUNTIME:node" +[ -f requirements.txt ] || [ -f pyproject.toml ] && echo "RUNTIME:python" +[ -f go.mod ] && echo "RUNTIME:go" +[ -f Cargo.toml ] && echo "RUNTIME:rust" +[ -f composer.json ] && echo "RUNTIME:php" +[ -f mix.exs ] && echo "RUNTIME:elixir" +# Detect sub-frameworks +[ -f Gemfile ] && grep -q "rails" Gemfile 2>/dev/null && echo "FRAMEWORK:rails" +[ -f package.json ] && grep -q '"next"' package.json 2>/dev/null && echo "FRAMEWORK:nextjs" +# Check for existing test infrastructure +ls jest.config.* vitest.config.* playwright.config.* .rspec pytest.ini pyproject.toml phpunit.xml 2>/dev/null +ls -d test/ tests/ spec/ __tests__/ cypress/ e2e/ 2>/dev/null +# Check opt-out marker +[ -f .gstack/no-test-bootstrap ] && echo "BOOTSTRAP_DECLINED" +``` + +**If test framework detected** (config files or test directories found): +Print "Test framework detected: {name} ({N} existing tests). Skipping bootstrap." +Read 2-3 existing test files to learn conventions (naming, imports, assertion style, setup patterns). +Store conventions as prose context for use in Phase 8e.5 or Step 3.4. **Skip the rest of bootstrap.** + +**If BOOTSTRAP_DECLINED** appears: Print "Test bootstrap previously declined — skipping." **Skip the rest of bootstrap.** + +**If NO runtime detected** (no config files found): Use AskUserQuestion: +"I couldn't detect your project's language. What runtime are you using?" +Options: A) Node.js/TypeScript B) Ruby/Rails C) Python D) Go E) Rust F) PHP G) Elixir H) This project doesn't need tests. +If user picks H → write `.gstack/no-test-bootstrap` and continue without tests. + +**If runtime detected but no test framework — bootstrap:** + +### B2. Research best practices + +Use WebSearch to find current best practices for the detected runtime: +- `"[runtime] best test framework 2025 2026"` +- `"[framework A] vs [framework B] comparison"` + +If WebSearch is unavailable, use this built-in knowledge table: + +| Runtime | Primary recommendation | Alternative | +|---------|----------------------|-------------| +| Ruby/Rails | minitest + fixtures + capybara | rspec + factory_bot + shoulda-matchers | +| Node.js | vitest + @testing-library | jest + @testing-library | +| Next.js | vitest + @testing-library/react + playwright | jest + cypress | +| Python | pytest + pytest-cov | unittest | +| Go | stdlib testing + testify | stdlib only | +| Rust | cargo test (built-in) + mockall | — | +| PHP | phpunit + mockery | pest | +| Elixir | ExUnit (built-in) + ex_machina | — | + +### B3. Framework selection + +Use AskUserQuestion: +"I detected this is a [Runtime/Framework] project with no test framework. I researched current best practices. Here are the options: +A) [Primary] — [rationale]. Includes: [packages]. Supports: unit, integration, smoke, e2e +B) [Alternative] — [rationale]. Includes: [packages] +C) Skip — don't set up testing right now +RECOMMENDATION: Choose A because [reason based on project context]" + +If user picks C → write `.gstack/no-test-bootstrap`. Tell user: "If you change your mind later, delete `.gstack/no-test-bootstrap` and re-run." Continue without tests. + +If multiple runtimes detected (monorepo) → ask which runtime to set up first, with option to do both sequentially. + +### B4. Install and configure + +1. Install the chosen packages (npm/bun/gem/pip/etc.) +2. Create minimal config file +3. Create directory structure (test/, spec/, etc.) +4. Create one example test matching the project's code to verify setup works + +If package installation fails → debug once. If still failing → revert with `git checkout -- package.json package-lock.json` (or equivalent for the runtime). Warn user and continue without tests. + +### B4.5. First real tests + +Generate 3-5 real tests for existing code: + +1. **Find recently changed files:** `git log --since=30.days --name-only --format="" | sort | uniq -c | sort -rn | head -10` +2. **Prioritize by risk:** Error handlers > business logic with conditionals > API endpoints > pure functions +3. **For each file:** Write one test that tests real behavior with meaningful assertions. Never `expect(x).toBeDefined()` — test what the code DOES. +4. Run each test. Passes → keep. Fails → fix once. Still fails → delete silently. +5. Generate at least 1 test, cap at 5. + +Never import secrets, API keys, or credentials in test files. Use environment variables or test fixtures. + +### B5. Verify + +```bash +# Run the full test suite to confirm everything works +{detected test command} +``` + +If tests fail → debug once. If still failing → revert all bootstrap changes and warn user. + +### B5.5. CI/CD pipeline + +```bash +# Check CI provider +ls -d .github/ 2>/dev/null && echo "CI:github" +ls .gitlab-ci.yml .circleci/ bitrise.yml 2>/dev/null +``` + +If `.github/` exists (or no CI detected — default to GitHub Actions): +Create `.github/workflows/test.yml` with: +- `runs-on: ubuntu-latest` +- Appropriate setup action for the runtime (setup-node, setup-ruby, setup-python, etc.) +- The same test command verified in B5 +- Trigger: push + pull_request + +If non-GitHub CI detected → skip CI generation with note: "Detected {provider} — CI pipeline generation supports GitHub Actions only. Add test step to your existing pipeline manually." + +### B6. Create TESTING.md + +First check: If TESTING.md already exists → read it and update/append rather than overwriting. Never destroy existing content. + +Write TESTING.md with: +- Philosophy: "100% test coverage is the key to great vibe coding. Tests let you move fast, trust your instincts, and ship with confidence — without them, vibe coding is just yolo coding. With tests, it's a superpower." +- Framework name and version +- How to run tests (the verified command from B5) +- Test layers: Unit tests (what, where, when), Integration tests, Smoke tests, E2E tests +- Conventions: file naming, assertion style, setup/teardown patterns + +### B7. Update CLAUDE.md + +First check: If CLAUDE.md already has a `## Testing` section → skip. Don't duplicate. + +Append a `## Testing` section: +- Run command and test directory +- Reference to TESTING.md +- Test expectations: + - 100% test coverage is the goal — tests make vibe coding safe + - When writing new functions, write a corresponding test + - When fixing a bug, write a regression test + - When adding error handling, write a test that triggers the error + - When adding a conditional (if/else, switch), write tests for BOTH paths + - Never commit code that makes existing tests fail + +### B8. Commit + +```bash +git status --porcelain +``` + +Only commit if there are changes. Stage all bootstrap files (config, test directory, TESTING.md, CLAUDE.md, .github/workflows/test.yml if created): +`git commit -m "chore: bootstrap test framework ({framework name})"` + +--- + **Create output directories:** ```bash @@ -565,6 +721,18 @@ Take **before/after screenshot pair** for every fix. - **best-effort**: fix applied but couldn't fully verify (e.g., needs specific browser state) - **reverted**: regression detected → `git revert HEAD` → mark finding as "deferred" +### 8e.5. Regression Test (design-review variant) + +Design fixes are typically CSS-only. Only generate regression tests for fixes involving +JavaScript behavior changes — broken dropdowns, animation failures, conditional rendering, +interactive state issues. + +For CSS-only fixes: skip entirely. CSS regressions are caught by re-running /qa-design-review. + +If the fix involved JS behavior: follow the same procedure as /qa Phase 8e.5 (study existing +test patterns, write a regression test encoding the exact bug condition, run it, commit if +passes or defer if fails). Commit format: `test(design): regression test for FINDING-NNN`. + ### 8f. Self-Regulation (STOP AND EVALUATE) Every 5 fixes (or after any revert), compute the design-fix risk level: @@ -639,7 +807,7 @@ If the repo has a `TODOS.md`: 11. **Clean working tree required.** Refuse to start if `git status --porcelain` is non-empty. 12. **One commit per fix.** Never bundle multiple design fixes into one commit. -13. **Never modify tests or CI configuration.** Only fix application source code and styles. +13. **Only modify tests when generating regression tests in Phase 8e.5.** Never modify CI configuration. Never modify existing tests — only create new test files. 14. **Revert on regression.** If a fix makes things worse, `git revert HEAD` immediately. 15. **Self-regulate.** Follow the design-fix risk heuristic. When in doubt, stop and ask. 16. **CSS-first.** Prefer CSS/styling changes over structural component changes. CSS-only changes are safer and more reversible. diff --git a/qa-design-review/SKILL.md.tmpl b/qa-design-review/SKILL.md.tmpl index 0053a494e29604b55a8fb1a4484edef6fc22ac47..5969fb52e802872427d4afd8ab403b2af41e5a55 100644 --- a/qa-design-review/SKILL.md.tmpl +++ b/qa-design-review/SKILL.md.tmpl @@ -14,6 +14,7 @@ allowed-tools: - Glob - Grep - AskUserQuestion + - WebSearch --- {{PREAMBLE}} @@ -54,6 +55,10 @@ fi {{BROWSE_SETUP}} +**Check test framework (bootstrap if needed):** + +{{TEST_BOOTSTRAP}} + **Create output directories:** ```bash @@ -153,6 +158,18 @@ Take **before/after screenshot pair** for every fix. - **best-effort**: fix applied but couldn't fully verify (e.g., needs specific browser state) - **reverted**: regression detected → `git revert HEAD` → mark finding as "deferred" +### 8e.5. Regression Test (design-review variant) + +Design fixes are typically CSS-only. Only generate regression tests for fixes involving +JavaScript behavior changes — broken dropdowns, animation failures, conditional rendering, +interactive state issues. + +For CSS-only fixes: skip entirely. CSS regressions are caught by re-running /qa-design-review. + +If the fix involved JS behavior: follow the same procedure as /qa Phase 8e.5 (study existing +test patterns, write a regression test encoding the exact bug condition, run it, commit if +passes or defer if fails). Commit format: `test(design): regression test for FINDING-NNN`. + ### 8f. Self-Regulation (STOP AND EVALUATE) Every 5 fixes (or after any revert), compute the design-fix risk level: @@ -227,7 +244,7 @@ If the repo has a `TODOS.md`: 11. **Clean working tree required.** Refuse to start if `git status --porcelain` is non-empty. 12. **One commit per fix.** Never bundle multiple design fixes into one commit. -13. **Never modify tests or CI configuration.** Only fix application source code and styles. +13. **Only modify tests when generating regression tests in Phase 8e.5.** Never modify CI configuration. Never modify existing tests — only create new test files. 14. **Revert on regression.** If a fix makes things worse, `git revert HEAD` immediately. 15. **Self-regulate.** Follow the design-fix risk heuristic. When in doubt, stop and ask. 16. **CSS-first.** Prefer CSS/styling changes over structural component changes. CSS-only changes are safer and more reversible. diff --git a/qa-only/SKILL.md b/qa-only/SKILL.md index 36f5fead92fb2c18262a65134ee1bb78c8554bd5..4fa0cf0445a9a62a17a65a00d4badc969378be3b 100644 --- a/qa-only/SKILL.md +++ b/qa-only/SKILL.md @@ -452,3 +452,4 @@ Report filenames use the domain and date: `qa-report-myapp-com-2026-03-12.md` ## Additional Rules (qa-only specific) 11. **Never fix bugs.** Find and document only. Do not read source code, edit files, or suggest fixes in the report. Your job is to report what's broken, not to fix it. Use `/qa` for the test-fix-verify loop. +12. **No test framework detected?** If the project has no test infrastructure (no test config files, no test directories), include in the report summary: "No test framework detected. Run `/qa` to bootstrap one and enable regression test generation." diff --git a/qa-only/SKILL.md.tmpl b/qa-only/SKILL.md.tmpl index 101cd71ca630ec7787df110331111a3d417bdd7b..831e71ed52ff3fd82cf151131745955d9583e68d 100644 --- a/qa-only/SKILL.md.tmpl +++ b/qa-only/SKILL.md.tmpl @@ -97,3 +97,4 @@ Report filenames use the domain and date: `qa-report-myapp-com-2026-03-12.md` ## Additional Rules (qa-only specific) 11. **Never fix bugs.** Find and document only. Do not read source code, edit files, or suggest fixes in the report. Your job is to report what's broken, not to fix it. Use `/qa` for the test-fix-verify loop. +12. **No test framework detected?** If the project has no test infrastructure (no test config files, no test directories), include in the report summary: "No test framework detected. Run `/qa` to bootstrap one and enable regression test generation." diff --git a/qa/SKILL.md b/qa/SKILL.md index 9bd8fc9b9febac3975641bcd8915879000817c55..c01514cf3511aa233e8da940a63c577cda07024a 100644 --- a/qa/SKILL.md +++ b/qa/SKILL.md @@ -16,6 +16,7 @@ allowed-tools: - Glob - Grep - AskUserQuestion + - WebSearch --- @@ -157,6 +158,161 @@ If `NEEDS_SETUP`: 2. Run: `cd && ./setup` 3. If `bun` is not installed: `curl -fsSL https://bun.sh/install | bash` +**Check test framework (bootstrap if needed):** + +## Test Framework Bootstrap + +**Detect existing test framework and project runtime:** + +```bash +# Detect project runtime +[ -f Gemfile ] && echo "RUNTIME:ruby" +[ -f package.json ] && echo "RUNTIME:node" +[ -f requirements.txt ] || [ -f pyproject.toml ] && echo "RUNTIME:python" +[ -f go.mod ] && echo "RUNTIME:go" +[ -f Cargo.toml ] && echo "RUNTIME:rust" +[ -f composer.json ] && echo "RUNTIME:php" +[ -f mix.exs ] && echo "RUNTIME:elixir" +# Detect sub-frameworks +[ -f Gemfile ] && grep -q "rails" Gemfile 2>/dev/null && echo "FRAMEWORK:rails" +[ -f package.json ] && grep -q '"next"' package.json 2>/dev/null && echo "FRAMEWORK:nextjs" +# Check for existing test infrastructure +ls jest.config.* vitest.config.* playwright.config.* .rspec pytest.ini pyproject.toml phpunit.xml 2>/dev/null +ls -d test/ tests/ spec/ __tests__/ cypress/ e2e/ 2>/dev/null +# Check opt-out marker +[ -f .gstack/no-test-bootstrap ] && echo "BOOTSTRAP_DECLINED" +``` + +**If test framework detected** (config files or test directories found): +Print "Test framework detected: {name} ({N} existing tests). Skipping bootstrap." +Read 2-3 existing test files to learn conventions (naming, imports, assertion style, setup patterns). +Store conventions as prose context for use in Phase 8e.5 or Step 3.4. **Skip the rest of bootstrap.** + +**If BOOTSTRAP_DECLINED** appears: Print "Test bootstrap previously declined — skipping." **Skip the rest of bootstrap.** + +**If NO runtime detected** (no config files found): Use AskUserQuestion: +"I couldn't detect your project's language. What runtime are you using?" +Options: A) Node.js/TypeScript B) Ruby/Rails C) Python D) Go E) Rust F) PHP G) Elixir H) This project doesn't need tests. +If user picks H → write `.gstack/no-test-bootstrap` and continue without tests. + +**If runtime detected but no test framework — bootstrap:** + +### B2. Research best practices + +Use WebSearch to find current best practices for the detected runtime: +- `"[runtime] best test framework 2025 2026"` +- `"[framework A] vs [framework B] comparison"` + +If WebSearch is unavailable, use this built-in knowledge table: + +| Runtime | Primary recommendation | Alternative | +|---------|----------------------|-------------| +| Ruby/Rails | minitest + fixtures + capybara | rspec + factory_bot + shoulda-matchers | +| Node.js | vitest + @testing-library | jest + @testing-library | +| Next.js | vitest + @testing-library/react + playwright | jest + cypress | +| Python | pytest + pytest-cov | unittest | +| Go | stdlib testing + testify | stdlib only | +| Rust | cargo test (built-in) + mockall | — | +| PHP | phpunit + mockery | pest | +| Elixir | ExUnit (built-in) + ex_machina | — | + +### B3. Framework selection + +Use AskUserQuestion: +"I detected this is a [Runtime/Framework] project with no test framework. I researched current best practices. Here are the options: +A) [Primary] — [rationale]. Includes: [packages]. Supports: unit, integration, smoke, e2e +B) [Alternative] — [rationale]. Includes: [packages] +C) Skip — don't set up testing right now +RECOMMENDATION: Choose A because [reason based on project context]" + +If user picks C → write `.gstack/no-test-bootstrap`. Tell user: "If you change your mind later, delete `.gstack/no-test-bootstrap` and re-run." Continue without tests. + +If multiple runtimes detected (monorepo) → ask which runtime to set up first, with option to do both sequentially. + +### B4. Install and configure + +1. Install the chosen packages (npm/bun/gem/pip/etc.) +2. Create minimal config file +3. Create directory structure (test/, spec/, etc.) +4. Create one example test matching the project's code to verify setup works + +If package installation fails → debug once. If still failing → revert with `git checkout -- package.json package-lock.json` (or equivalent for the runtime). Warn user and continue without tests. + +### B4.5. First real tests + +Generate 3-5 real tests for existing code: + +1. **Find recently changed files:** `git log --since=30.days --name-only --format="" | sort | uniq -c | sort -rn | head -10` +2. **Prioritize by risk:** Error handlers > business logic with conditionals > API endpoints > pure functions +3. **For each file:** Write one test that tests real behavior with meaningful assertions. Never `expect(x).toBeDefined()` — test what the code DOES. +4. Run each test. Passes → keep. Fails → fix once. Still fails → delete silently. +5. Generate at least 1 test, cap at 5. + +Never import secrets, API keys, or credentials in test files. Use environment variables or test fixtures. + +### B5. Verify + +```bash +# Run the full test suite to confirm everything works +{detected test command} +``` + +If tests fail → debug once. If still failing → revert all bootstrap changes and warn user. + +### B5.5. CI/CD pipeline + +```bash +# Check CI provider +ls -d .github/ 2>/dev/null && echo "CI:github" +ls .gitlab-ci.yml .circleci/ bitrise.yml 2>/dev/null +``` + +If `.github/` exists (or no CI detected — default to GitHub Actions): +Create `.github/workflows/test.yml` with: +- `runs-on: ubuntu-latest` +- Appropriate setup action for the runtime (setup-node, setup-ruby, setup-python, etc.) +- The same test command verified in B5 +- Trigger: push + pull_request + +If non-GitHub CI detected → skip CI generation with note: "Detected {provider} — CI pipeline generation supports GitHub Actions only. Add test step to your existing pipeline manually." + +### B6. Create TESTING.md + +First check: If TESTING.md already exists → read it and update/append rather than overwriting. Never destroy existing content. + +Write TESTING.md with: +- Philosophy: "100% test coverage is the key to great vibe coding. Tests let you move fast, trust your instincts, and ship with confidence — without them, vibe coding is just yolo coding. With tests, it's a superpower." +- Framework name and version +- How to run tests (the verified command from B5) +- Test layers: Unit tests (what, where, when), Integration tests, Smoke tests, E2E tests +- Conventions: file naming, assertion style, setup/teardown patterns + +### B7. Update CLAUDE.md + +First check: If CLAUDE.md already has a `## Testing` section → skip. Don't duplicate. + +Append a `## Testing` section: +- Run command and test directory +- Reference to TESTING.md +- Test expectations: + - 100% test coverage is the goal — tests make vibe coding safe + - When writing new functions, write a corresponding test + - When fixing a bug, write a regression test + - When adding error handling, write a test that triggers the error + - When adding a conditional (if/else, switch), write tests for BOTH paths + - Never commit code that makes existing tests fail + +### B8. Commit + +```bash +git status --porcelain +``` + +Only commit if there are changes. Stage all bootstrap files (config, test directory, TESTING.md, CLAUDE.md, .github/workflows/test.yml if created): +`git commit -m "chore: bootstrap test framework ({framework name})"` + +--- + **Create output directories:** ```bash @@ -541,6 +697,59 @@ $B snapshot -D - **best-effort**: fix applied but couldn't fully verify (e.g., needs auth state, external service) - **reverted**: regression detected → `git revert HEAD` → mark issue as "deferred" +### 8e.5. Regression Test + +Skip if: classification is not "verified", OR the fix is purely visual/CSS with no JS behavior, OR no test framework was detected AND user declined bootstrap. + +**1. Study the project's existing test patterns:** + +Read 2-3 test files closest to the fix (same directory, same code type). Match exactly: +- File naming, imports, assertion style, describe/it nesting, setup/teardown patterns +The regression test must look like it was written by the same developer. + +**2. Trace the bug's codepath, then write a regression test:** + +Before writing the test, trace the data flow through the code you just fixed: +- What input/state triggered the bug? (the exact precondition) +- What codepath did it follow? (which branches, which function calls) +- Where did it break? (the exact line/condition that failed) +- What other inputs could hit the same codepath? (edge cases around the fix) + +The test MUST: +- Set up the precondition that triggered the bug (the exact state that made it break) +- Perform the action that exposed the bug +- Assert the correct behavior (NOT "it renders" or "it doesn't throw") +- If you found adjacent edge cases while tracing, test those too (e.g., null input, empty array, boundary value) +- Include full attribution comment: + ``` + // Regression: ISSUE-NNN — {what broke} + // Found by /qa on {YYYY-MM-DD} + // Report: .gstack/qa-reports/qa-report-{domain}-{date}.md + ``` + +Test type decision: +- Console error / JS exception / logic bug → unit or integration test +- Broken form / API failure / data flow bug → integration test with request/response +- Visual bug with JS behavior (broken dropdown, animation) → component test +- Pure CSS → skip (caught by QA reruns) + +Generate unit tests. Mock all external dependencies (DB, API, Redis, file system). + +Use auto-incrementing names to avoid collisions: check existing `{name}.regression-*.test.{ext}` files, take max number + 1. + +**3. Run only the new test file:** + +```bash +{detected test command} {new-test-file} +``` + +**4. Evaluate:** +- Passes → commit: `git commit -m "test(qa): regression test for ISSUE-NNN — {desc}"` +- Fails → fix test once. Still failing → delete test, defer. +- Taking >2 min exploration → skip and defer. + +**5. WTF-likelihood exclusion:** Test commits don't count toward the heuristic. + ### 8f. Self-Regulation (STOP AND EVALUATE) Every 5 fixes (or after any revert), compute the WTF-likelihood: @@ -614,6 +823,6 @@ If the repo has a `TODOS.md`: 11. **Clean working tree required.** Refuse to start if `git status --porcelain` is non-empty. 12. **One commit per fix.** Never bundle multiple fixes into one commit. -13. **Never modify tests or CI configuration.** Only fix application source code. +13. **Only modify tests when generating regression tests in Phase 8e.5.** Never modify CI configuration. Never modify existing tests — only create new test files. 14. **Revert on regression.** If a fix makes things worse, `git revert HEAD` immediately. 15. **Self-regulate.** Follow the WTF-likelihood heuristic. When in doubt, stop and ask. diff --git a/qa/SKILL.md.tmpl b/qa/SKILL.md.tmpl index 45dfbea6a206d79bc81e99cf9e7b6853b05a5b2a..bd94debe73a74495288fcac3851f027b2e624f88 100644 --- a/qa/SKILL.md.tmpl +++ b/qa/SKILL.md.tmpl @@ -16,6 +16,7 @@ allowed-tools: - Glob - Grep - AskUserQuestion + - WebSearch --- {{PREAMBLE}} @@ -58,6 +59,10 @@ fi {{BROWSE_SETUP}} +**Check test framework (bootstrap if needed):** + +{{TEST_BOOTSTRAP}} + **Create output directories:** ```bash @@ -169,6 +174,59 @@ $B snapshot -D - **best-effort**: fix applied but couldn't fully verify (e.g., needs auth state, external service) - **reverted**: regression detected → `git revert HEAD` → mark issue as "deferred" +### 8e.5. Regression Test + +Skip if: classification is not "verified", OR the fix is purely visual/CSS with no JS behavior, OR no test framework was detected AND user declined bootstrap. + +**1. Study the project's existing test patterns:** + +Read 2-3 test files closest to the fix (same directory, same code type). Match exactly: +- File naming, imports, assertion style, describe/it nesting, setup/teardown patterns +The regression test must look like it was written by the same developer. + +**2. Trace the bug's codepath, then write a regression test:** + +Before writing the test, trace the data flow through the code you just fixed: +- What input/state triggered the bug? (the exact precondition) +- What codepath did it follow? (which branches, which function calls) +- Where did it break? (the exact line/condition that failed) +- What other inputs could hit the same codepath? (edge cases around the fix) + +The test MUST: +- Set up the precondition that triggered the bug (the exact state that made it break) +- Perform the action that exposed the bug +- Assert the correct behavior (NOT "it renders" or "it doesn't throw") +- If you found adjacent edge cases while tracing, test those too (e.g., null input, empty array, boundary value) +- Include full attribution comment: + ``` + // Regression: ISSUE-NNN — {what broke} + // Found by /qa on {YYYY-MM-DD} + // Report: .gstack/qa-reports/qa-report-{domain}-{date}.md + ``` + +Test type decision: +- Console error / JS exception / logic bug → unit or integration test +- Broken form / API failure / data flow bug → integration test with request/response +- Visual bug with JS behavior (broken dropdown, animation) → component test +- Pure CSS → skip (caught by QA reruns) + +Generate unit tests. Mock all external dependencies (DB, API, Redis, file system). + +Use auto-incrementing names to avoid collisions: check existing `{name}.regression-*.test.{ext}` files, take max number + 1. + +**3. Run only the new test file:** + +```bash +{detected test command} {new-test-file} +``` + +**4. Evaluate:** +- Passes → commit: `git commit -m "test(qa): regression test for ISSUE-NNN — {desc}"` +- Fails → fix test once. Still failing → delete test, defer. +- Taking >2 min exploration → skip and defer. + +**5. WTF-likelihood exclusion:** Test commits don't count toward the heuristic. + ### 8f. Self-Regulation (STOP AND EVALUATE) Every 5 fixes (or after any revert), compute the WTF-likelihood: @@ -242,6 +300,6 @@ If the repo has a `TODOS.md`: 11. **Clean working tree required.** Refuse to start if `git status --porcelain` is non-empty. 12. **One commit per fix.** Never bundle multiple fixes into one commit. -13. **Never modify tests or CI configuration.** Only fix application source code. +13. **Only modify tests when generating regression tests in Phase 8e.5.** Never modify CI configuration. Never modify existing tests — only create new test files. 14. **Revert on regression.** If a fix makes things worse, `git revert HEAD` immediately. 15. **Self-regulate.** Follow the WTF-likelihood heuristic. When in doubt, stop and ask. diff --git a/qa/templates/qa-report-template.md b/qa/templates/qa-report-template.md index 5466bda42b4f7495dd301b3963d5ffcfde565873..6aa30943392b75ca8fb8694be31edf89881afd92 100644 --- a/qa/templates/qa-report-template.md +++ b/qa/templates/qa-report-template.md @@ -86,6 +86,22 @@ --- +## Regression Tests + +| Issue | Test File | Status | Description | +|-------|-----------|--------|-------------| +| ISSUE-NNN | path/to/test | committed / deferred / skipped | description | + +### Deferred Tests + +#### ISSUE-NNN: {title} +**Precondition:** {setup state that triggers the bug} +**Action:** {what the user does} +**Expected:** {correct behavior} +**Why deferred:** {reason} + +--- + ## Ship Readiness | Metric | Value | diff --git a/retro/SKILL.md b/retro/SKILL.md index c77815259d20bfe98968838858cca4107ddb248d..e7cd3d2c4296c923bfbd64bdc3e52f1dc0447077 100644 --- a/retro/SKILL.md +++ b/retro/SKILL.md @@ -164,6 +164,15 @@ cat ~/.gstack/greptile-history.md 2>/dev/null || true # 9. TODOS.md backlog (if available) cat TODOS.md 2>/dev/null || true + +# 10. Test file count +find . -name '*.test.*' -o -name '*.spec.*' -o -name '*_test.*' -o -name '*_spec.*' 2>/dev/null | grep -v node_modules | wc -l + +# 11. Regression test commits in window +git log origin/ --since="" --oneline --grep="test(qa):" --grep="test(design):" --grep="test: coverage" + +# 12. Test files changed in window +git log origin/ --since="" --format="" --name-only | grep -E '\.(test|spec)\.' | sort -u | wc -l ``` ### Step 2: Compute Metrics @@ -185,6 +194,7 @@ Calculate and present these metrics in a summary table: | Detected sessions | N | | Avg LOC/session-hour | N | | Greptile signal | N% (Y catches, Z FPs) | +| Test Health | N total tests · M added this period · K regression tests | Then show a **per-author leaderboard** immediately below: @@ -408,7 +418,17 @@ Use the Write tool to save the JSON file with this schema: } ``` -**Note:** Only include the `greptile` field if `~/.gstack/greptile-history.md` exists and has entries within the time window. Only include the `backlog` field if `TODOS.md` exists. If either has no data, omit the field entirely. +**Note:** Only include the `greptile` field if `~/.gstack/greptile-history.md` exists and has entries within the time window. Only include the `backlog` field if `TODOS.md` exists. Only include the `test_health` field if test files were found (command 10 returns > 0). If any has no data, omit the field entirely. + +Include test health data in the JSON when test files exist: +```json + "test_health": { + "total_test_files": 47, + "tests_added_this_period": 5, + "regression_test_commits": 3, + "test_files_changed": 8 + } +``` Include backlog data in the JSON when TODOS.md exists: ```json @@ -464,6 +484,13 @@ Narrative covering: - Any XL PRs that should have been split - Greptile signal ratio and trend (if history exists): "Greptile: X% signal (Y valid catches, Z false positives)" +### Test Health +- Total test files: N (from command 10) +- Tests added this period: M (from command 12 — test files changed) +- Regression test commits: list `test(qa):` and `test(design):` and `test: coverage` commits from command 11 +- If prior retro exists and has `test_health`: show delta "Test count: {last} → {now} (+{delta})" +- If test ratio < 20%: flag as growth area — "100% test coverage is the goal. Tests make vibe coding safe." + ### Focus & Highlights (from Step 8) - Focus score with interpretation diff --git a/retro/SKILL.md.tmpl b/retro/SKILL.md.tmpl index 2f39fb5c8ad1997489d6eda95ff98a0f3f0d9fff..bfbc2003bffcf5610bfdcae09bdbf829be250379 100644 --- a/retro/SKILL.md.tmpl +++ b/retro/SKILL.md.tmpl @@ -99,6 +99,15 @@ cat ~/.gstack/greptile-history.md 2>/dev/null || true # 9. TODOS.md backlog (if available) cat TODOS.md 2>/dev/null || true + +# 10. Test file count +find . -name '*.test.*' -o -name '*.spec.*' -o -name '*_test.*' -o -name '*_spec.*' 2>/dev/null | grep -v node_modules | wc -l + +# 11. Regression test commits in window +git log origin/ --since="" --oneline --grep="test(qa):" --grep="test(design):" --grep="test: coverage" + +# 12. Test files changed in window +git log origin/ --since="" --format="" --name-only | grep -E '\.(test|spec)\.' | sort -u | wc -l ``` ### Step 2: Compute Metrics @@ -120,6 +129,7 @@ Calculate and present these metrics in a summary table: | Detected sessions | N | | Avg LOC/session-hour | N | | Greptile signal | N% (Y catches, Z FPs) | +| Test Health | N total tests · M added this period · K regression tests | Then show a **per-author leaderboard** immediately below: @@ -343,7 +353,17 @@ Use the Write tool to save the JSON file with this schema: } ``` -**Note:** Only include the `greptile` field if `~/.gstack/greptile-history.md` exists and has entries within the time window. Only include the `backlog` field if `TODOS.md` exists. If either has no data, omit the field entirely. +**Note:** Only include the `greptile` field if `~/.gstack/greptile-history.md` exists and has entries within the time window. Only include the `backlog` field if `TODOS.md` exists. Only include the `test_health` field if test files were found (command 10 returns > 0). If any has no data, omit the field entirely. + +Include test health data in the JSON when test files exist: +```json + "test_health": { + "total_test_files": 47, + "tests_added_this_period": 5, + "regression_test_commits": 3, + "test_files_changed": 8 + } +``` Include backlog data in the JSON when TODOS.md exists: ```json @@ -399,6 +419,13 @@ Narrative covering: - Any XL PRs that should have been split - Greptile signal ratio and trend (if history exists): "Greptile: X% signal (Y valid catches, Z false positives)" +### Test Health +- Total test files: N (from command 10) +- Tests added this period: M (from command 12 — test files changed) +- Regression test commits: list `test(qa):` and `test(design):` and `test: coverage` commits from command 11 +- If prior retro exists and has `test_health`: show delta "Test count: {last} → {now} (+{delta})" +- If test ratio < 20%: flag as growth area — "100% test coverage is the goal. Tests make vibe coding safe." + ### Focus & Highlights (from Step 8) - Focus score with interpretation diff --git a/scripts/gen-skill-docs.ts b/scripts/gen-skill-docs.ts index ee8a1c097376c250bb79533dfbbbe98a47ac610e..31684e2175eb09cbefb823f72c7e6a846e2b4228 100644 --- a/scripts/gen-skill-docs.ts +++ b/scripts/gen-skill-docs.ts @@ -854,6 +854,161 @@ Parse the output. Find the most recent entry for each skill (plan-ceo-review, pl - If \\\`skip_eng_review\\\` config is \\\`true\\\`, Eng Review shows "SKIPPED (global)" and verdict is CLEARED`; } +function generateTestBootstrap(): string { + return `## Test Framework Bootstrap + +**Detect existing test framework and project runtime:** + +\`\`\`bash +# Detect project runtime +[ -f Gemfile ] && echo "RUNTIME:ruby" +[ -f package.json ] && echo "RUNTIME:node" +[ -f requirements.txt ] || [ -f pyproject.toml ] && echo "RUNTIME:python" +[ -f go.mod ] && echo "RUNTIME:go" +[ -f Cargo.toml ] && echo "RUNTIME:rust" +[ -f composer.json ] && echo "RUNTIME:php" +[ -f mix.exs ] && echo "RUNTIME:elixir" +# Detect sub-frameworks +[ -f Gemfile ] && grep -q "rails" Gemfile 2>/dev/null && echo "FRAMEWORK:rails" +[ -f package.json ] && grep -q '"next"' package.json 2>/dev/null && echo "FRAMEWORK:nextjs" +# Check for existing test infrastructure +ls jest.config.* vitest.config.* playwright.config.* .rspec pytest.ini pyproject.toml phpunit.xml 2>/dev/null +ls -d test/ tests/ spec/ __tests__/ cypress/ e2e/ 2>/dev/null +# Check opt-out marker +[ -f .gstack/no-test-bootstrap ] && echo "BOOTSTRAP_DECLINED" +\`\`\` + +**If test framework detected** (config files or test directories found): +Print "Test framework detected: {name} ({N} existing tests). Skipping bootstrap." +Read 2-3 existing test files to learn conventions (naming, imports, assertion style, setup patterns). +Store conventions as prose context for use in Phase 8e.5 or Step 3.4. **Skip the rest of bootstrap.** + +**If BOOTSTRAP_DECLINED** appears: Print "Test bootstrap previously declined — skipping." **Skip the rest of bootstrap.** + +**If NO runtime detected** (no config files found): Use AskUserQuestion: +"I couldn't detect your project's language. What runtime are you using?" +Options: A) Node.js/TypeScript B) Ruby/Rails C) Python D) Go E) Rust F) PHP G) Elixir H) This project doesn't need tests. +If user picks H → write \`.gstack/no-test-bootstrap\` and continue without tests. + +**If runtime detected but no test framework — bootstrap:** + +### B2. Research best practices + +Use WebSearch to find current best practices for the detected runtime: +- \`"[runtime] best test framework 2025 2026"\` +- \`"[framework A] vs [framework B] comparison"\` + +If WebSearch is unavailable, use this built-in knowledge table: + +| Runtime | Primary recommendation | Alternative | +|---------|----------------------|-------------| +| Ruby/Rails | minitest + fixtures + capybara | rspec + factory_bot + shoulda-matchers | +| Node.js | vitest + @testing-library | jest + @testing-library | +| Next.js | vitest + @testing-library/react + playwright | jest + cypress | +| Python | pytest + pytest-cov | unittest | +| Go | stdlib testing + testify | stdlib only | +| Rust | cargo test (built-in) + mockall | — | +| PHP | phpunit + mockery | pest | +| Elixir | ExUnit (built-in) + ex_machina | — | + +### B3. Framework selection + +Use AskUserQuestion: +"I detected this is a [Runtime/Framework] project with no test framework. I researched current best practices. Here are the options: +A) [Primary] — [rationale]. Includes: [packages]. Supports: unit, integration, smoke, e2e +B) [Alternative] — [rationale]. Includes: [packages] +C) Skip — don't set up testing right now +RECOMMENDATION: Choose A because [reason based on project context]" + +If user picks C → write \`.gstack/no-test-bootstrap\`. Tell user: "If you change your mind later, delete \`.gstack/no-test-bootstrap\` and re-run." Continue without tests. + +If multiple runtimes detected (monorepo) → ask which runtime to set up first, with option to do both sequentially. + +### B4. Install and configure + +1. Install the chosen packages (npm/bun/gem/pip/etc.) +2. Create minimal config file +3. Create directory structure (test/, spec/, etc.) +4. Create one example test matching the project's code to verify setup works + +If package installation fails → debug once. If still failing → revert with \`git checkout -- package.json package-lock.json\` (or equivalent for the runtime). Warn user and continue without tests. + +### B4.5. First real tests + +Generate 3-5 real tests for existing code: + +1. **Find recently changed files:** \`git log --since=30.days --name-only --format="" | sort | uniq -c | sort -rn | head -10\` +2. **Prioritize by risk:** Error handlers > business logic with conditionals > API endpoints > pure functions +3. **For each file:** Write one test that tests real behavior with meaningful assertions. Never \`expect(x).toBeDefined()\` — test what the code DOES. +4. Run each test. Passes → keep. Fails → fix once. Still fails → delete silently. +5. Generate at least 1 test, cap at 5. + +Never import secrets, API keys, or credentials in test files. Use environment variables or test fixtures. + +### B5. Verify + +\`\`\`bash +# Run the full test suite to confirm everything works +{detected test command} +\`\`\` + +If tests fail → debug once. If still failing → revert all bootstrap changes and warn user. + +### B5.5. CI/CD pipeline + +\`\`\`bash +# Check CI provider +ls -d .github/ 2>/dev/null && echo "CI:github" +ls .gitlab-ci.yml .circleci/ bitrise.yml 2>/dev/null +\`\`\` + +If \`.github/\` exists (or no CI detected — default to GitHub Actions): +Create \`.github/workflows/test.yml\` with: +- \`runs-on: ubuntu-latest\` +- Appropriate setup action for the runtime (setup-node, setup-ruby, setup-python, etc.) +- The same test command verified in B5 +- Trigger: push + pull_request + +If non-GitHub CI detected → skip CI generation with note: "Detected {provider} — CI pipeline generation supports GitHub Actions only. Add test step to your existing pipeline manually." + +### B6. Create TESTING.md + +First check: If TESTING.md already exists → read it and update/append rather than overwriting. Never destroy existing content. + +Write TESTING.md with: +- Philosophy: "100% test coverage is the key to great vibe coding. Tests let you move fast, trust your instincts, and ship with confidence — without them, vibe coding is just yolo coding. With tests, it's a superpower." +- Framework name and version +- How to run tests (the verified command from B5) +- Test layers: Unit tests (what, where, when), Integration tests, Smoke tests, E2E tests +- Conventions: file naming, assertion style, setup/teardown patterns + +### B7. Update CLAUDE.md + +First check: If CLAUDE.md already has a \`## Testing\` section → skip. Don't duplicate. + +Append a \`## Testing\` section: +- Run command and test directory +- Reference to TESTING.md +- Test expectations: + - 100% test coverage is the goal — tests make vibe coding safe + - When writing new functions, write a corresponding test + - When fixing a bug, write a regression test + - When adding error handling, write a test that triggers the error + - When adding a conditional (if/else, switch), write tests for BOTH paths + - Never commit code that makes existing tests fail + +### B8. Commit + +\`\`\`bash +git status --porcelain +\`\`\` + +Only commit if there are changes. Stage all bootstrap files (config, test directory, TESTING.md, CLAUDE.md, .github/workflows/test.yml if created): +\`git commit -m "chore: bootstrap test framework ({framework name})"\` + +---`; +} + const RESOLVERS: Record string> = { COMMAND_REFERENCE: generateCommandReference, SNAPSHOT_FLAGS: generateSnapshotFlags, @@ -863,6 +1018,7 @@ const RESOLVERS: Record string> = { QA_METHODOLOGY: generateQAMethodology, DESIGN_METHODOLOGY: generateDesignMethodology, REVIEW_DASHBOARD: generateReviewDashboard, + TEST_BOOTSTRAP: generateTestBootstrap, }; // ─── Template Processing ──────────────────────────────────── diff --git a/ship/SKILL.md b/ship/SKILL.md index dc1a86a27887f59e8ba736a28ce83be22e6e93a8..32582088ea76b4b5041375f272287cc497e6532b 100644 --- a/ship/SKILL.md +++ b/ship/SKILL.md @@ -11,6 +11,7 @@ allowed-tools: - Grep - Glob - AskUserQuestion + - WebSearch --- @@ -121,6 +122,7 @@ You are running the `/ship` workflow. This is a **non-interactive, fully automat - Multi-file changesets (auto-split into bisectable commits) - TODOS.md completed-item detection (auto-mark) - Auto-fixable review findings (dead code, N+1, stale comments — fixed automatically) +- Test coverage gaps (auto-generate and commit, or flag in PR body) --- @@ -210,6 +212,163 @@ git fetch origin && git merge origin/ --no-edit --- +## Step 2.5: Test Framework Bootstrap + +## Test Framework Bootstrap + +**Detect existing test framework and project runtime:** + +```bash +# Detect project runtime +[ -f Gemfile ] && echo "RUNTIME:ruby" +[ -f package.json ] && echo "RUNTIME:node" +[ -f requirements.txt ] || [ -f pyproject.toml ] && echo "RUNTIME:python" +[ -f go.mod ] && echo "RUNTIME:go" +[ -f Cargo.toml ] && echo "RUNTIME:rust" +[ -f composer.json ] && echo "RUNTIME:php" +[ -f mix.exs ] && echo "RUNTIME:elixir" +# Detect sub-frameworks +[ -f Gemfile ] && grep -q "rails" Gemfile 2>/dev/null && echo "FRAMEWORK:rails" +[ -f package.json ] && grep -q '"next"' package.json 2>/dev/null && echo "FRAMEWORK:nextjs" +# Check for existing test infrastructure +ls jest.config.* vitest.config.* playwright.config.* .rspec pytest.ini pyproject.toml phpunit.xml 2>/dev/null +ls -d test/ tests/ spec/ __tests__/ cypress/ e2e/ 2>/dev/null +# Check opt-out marker +[ -f .gstack/no-test-bootstrap ] && echo "BOOTSTRAP_DECLINED" +``` + +**If test framework detected** (config files or test directories found): +Print "Test framework detected: {name} ({N} existing tests). Skipping bootstrap." +Read 2-3 existing test files to learn conventions (naming, imports, assertion style, setup patterns). +Store conventions as prose context for use in Phase 8e.5 or Step 3.4. **Skip the rest of bootstrap.** + +**If BOOTSTRAP_DECLINED** appears: Print "Test bootstrap previously declined — skipping." **Skip the rest of bootstrap.** + +**If NO runtime detected** (no config files found): Use AskUserQuestion: +"I couldn't detect your project's language. What runtime are you using?" +Options: A) Node.js/TypeScript B) Ruby/Rails C) Python D) Go E) Rust F) PHP G) Elixir H) This project doesn't need tests. +If user picks H → write `.gstack/no-test-bootstrap` and continue without tests. + +**If runtime detected but no test framework — bootstrap:** + +### B2. Research best practices + +Use WebSearch to find current best practices for the detected runtime: +- `"[runtime] best test framework 2025 2026"` +- `"[framework A] vs [framework B] comparison"` + +If WebSearch is unavailable, use this built-in knowledge table: + +| Runtime | Primary recommendation | Alternative | +|---------|----------------------|-------------| +| Ruby/Rails | minitest + fixtures + capybara | rspec + factory_bot + shoulda-matchers | +| Node.js | vitest + @testing-library | jest + @testing-library | +| Next.js | vitest + @testing-library/react + playwright | jest + cypress | +| Python | pytest + pytest-cov | unittest | +| Go | stdlib testing + testify | stdlib only | +| Rust | cargo test (built-in) + mockall | — | +| PHP | phpunit + mockery | pest | +| Elixir | ExUnit (built-in) + ex_machina | — | + +### B3. Framework selection + +Use AskUserQuestion: +"I detected this is a [Runtime/Framework] project with no test framework. I researched current best practices. Here are the options: +A) [Primary] — [rationale]. Includes: [packages]. Supports: unit, integration, smoke, e2e +B) [Alternative] — [rationale]. Includes: [packages] +C) Skip — don't set up testing right now +RECOMMENDATION: Choose A because [reason based on project context]" + +If user picks C → write `.gstack/no-test-bootstrap`. Tell user: "If you change your mind later, delete `.gstack/no-test-bootstrap` and re-run." Continue without tests. + +If multiple runtimes detected (monorepo) → ask which runtime to set up first, with option to do both sequentially. + +### B4. Install and configure + +1. Install the chosen packages (npm/bun/gem/pip/etc.) +2. Create minimal config file +3. Create directory structure (test/, spec/, etc.) +4. Create one example test matching the project's code to verify setup works + +If package installation fails → debug once. If still failing → revert with `git checkout -- package.json package-lock.json` (or equivalent for the runtime). Warn user and continue without tests. + +### B4.5. First real tests + +Generate 3-5 real tests for existing code: + +1. **Find recently changed files:** `git log --since=30.days --name-only --format="" | sort | uniq -c | sort -rn | head -10` +2. **Prioritize by risk:** Error handlers > business logic with conditionals > API endpoints > pure functions +3. **For each file:** Write one test that tests real behavior with meaningful assertions. Never `expect(x).toBeDefined()` — test what the code DOES. +4. Run each test. Passes → keep. Fails → fix once. Still fails → delete silently. +5. Generate at least 1 test, cap at 5. + +Never import secrets, API keys, or credentials in test files. Use environment variables or test fixtures. + +### B5. Verify + +```bash +# Run the full test suite to confirm everything works +{detected test command} +``` + +If tests fail → debug once. If still failing → revert all bootstrap changes and warn user. + +### B5.5. CI/CD pipeline + +```bash +# Check CI provider +ls -d .github/ 2>/dev/null && echo "CI:github" +ls .gitlab-ci.yml .circleci/ bitrise.yml 2>/dev/null +``` + +If `.github/` exists (or no CI detected — default to GitHub Actions): +Create `.github/workflows/test.yml` with: +- `runs-on: ubuntu-latest` +- Appropriate setup action for the runtime (setup-node, setup-ruby, setup-python, etc.) +- The same test command verified in B5 +- Trigger: push + pull_request + +If non-GitHub CI detected → skip CI generation with note: "Detected {provider} — CI pipeline generation supports GitHub Actions only. Add test step to your existing pipeline manually." + +### B6. Create TESTING.md + +First check: If TESTING.md already exists → read it and update/append rather than overwriting. Never destroy existing content. + +Write TESTING.md with: +- Philosophy: "100% test coverage is the key to great vibe coding. Tests let you move fast, trust your instincts, and ship with confidence — without them, vibe coding is just yolo coding. With tests, it's a superpower." +- Framework name and version +- How to run tests (the verified command from B5) +- Test layers: Unit tests (what, where, when), Integration tests, Smoke tests, E2E tests +- Conventions: file naming, assertion style, setup/teardown patterns + +### B7. Update CLAUDE.md + +First check: If CLAUDE.md already has a `## Testing` section → skip. Don't duplicate. + +Append a `## Testing` section: +- Run command and test directory +- Reference to TESTING.md +- Test expectations: + - 100% test coverage is the goal — tests make vibe coding safe + - When writing new functions, write a corresponding test + - When fixing a bug, write a regression test + - When adding error handling, write a test that triggers the error + - When adding a conditional (if/else, switch), write tests for BOTH paths + - Never commit code that makes existing tests fail + +### B8. Commit + +```bash +git status --porcelain +``` + +Only commit if there are changes. Stage all bootstrap files (config, test directory, TESTING.md, CLAUDE.md, .github/workflows/test.yml if created): +`git commit -m "chore: bootstrap test framework ({framework name})"` + +--- + +--- + ## Step 3: Run tests (on merged code) **Do NOT run `RAILS_ENV=test bin/rails db:migrate`** — `bin/test-lane` already calls @@ -294,6 +453,144 @@ If multiple suites need to run, run them sequentially (each needs a test lane). --- +## Step 3.4: Test Coverage Audit + +100% coverage is the goal — every untested path is a path where bugs hide and vibe coding becomes yolo coding. Evaluate what was ACTUALLY coded (from the diff), not what was planned. + +**0. Before/after test count:** + +```bash +# Count test files before any generation +find . -name '*.test.*' -o -name '*.spec.*' -o -name '*_test.*' -o -name '*_spec.*' | grep -v node_modules | wc -l +``` + +Store this number for the PR body. + +**1. Trace every codepath changed** using `git diff origin/...HEAD`: + +Read every changed file. For each one, trace how data flows through the code — don't just list functions, actually follow the execution: + +1. **Read the diff.** For each changed file, read the full file (not just the diff hunk) to understand context. +2. **Trace data flow.** Starting from each entry point (route handler, exported function, event listener, component render), follow the data through every branch: + - Where does input come from? (request params, props, database, API call) + - What transforms it? (validation, mapping, computation) + - Where does it go? (database write, API response, rendered output, side effect) + - What can go wrong at each step? (null/undefined, invalid input, network failure, empty collection) +3. **Diagram the execution.** For each changed file, draw an ASCII diagram showing: + - Every function/method that was added or modified + - Every conditional branch (if/else, switch, ternary, guard clause, early return) + - Every error path (try/catch, rescue, error boundary, fallback) + - Every call to another function (trace into it — does IT have untested branches?) + - Every edge: what happens with null input? Empty array? Invalid type? + +This is the critical step — you're building a map of every line of code that can execute differently based on input. Every branch in this diagram needs a test. + +**2. Map user flows, interactions, and error states:** + +Code coverage isn't enough — you need to cover how real users interact with the changed code. For each changed feature, think through: + +- **User flows:** What sequence of actions does a user take that touches this code? Map the full journey (e.g., "user clicks 'Pay' → form validates → API call → success/failure screen"). Each step in the journey needs a test. +- **Interaction edge cases:** What happens when the user does something unexpected? + - Double-click/rapid resubmit + - Navigate away mid-operation (back button, close tab, click another link) + - Submit with stale data (page sat open for 30 minutes, session expired) + - Slow connection (API takes 10 seconds — what does the user see?) + - Concurrent actions (two tabs, same form) +- **Error states the user can see:** For every error the code handles, what does the user actually experience? + - Is there a clear error message or a silent failure? + - Can the user recover (retry, go back, fix input) or are they stuck? + - What happens with no network? With a 500 from the API? With invalid data from the server? +- **Empty/zero/boundary states:** What does the UI show with zero results? With 10,000 results? With a single character input? With maximum-length input? + +Add these to your diagram alongside the code branches. A user flow with no test is just as much a gap as an untested if/else. + +**3. Check each branch against existing tests:** + +Go through your diagram branch by branch — both code paths AND user flows. For each one, search for a test that exercises it: +- Function `processPayment()` → look for `billing.test.ts`, `billing.spec.ts`, `test/billing_test.rb` +- An if/else → look for tests covering BOTH the true AND false path +- An error handler → look for a test that triggers that specific error condition +- A call to `helperFn()` that has its own branches → those branches need tests too +- A user flow → look for an integration or E2E test that walks through the journey +- An interaction edge case → look for a test that simulates the unexpected action + +Quality scoring rubric: +- ★★★ Tests behavior with edge cases AND error paths +- ★★ Tests correct behavior, happy path only +- ★ Smoke test / existence check / trivial assertion (e.g., "it renders", "it doesn't throw") + +**4. Output ASCII coverage diagram:** + +Include BOTH code paths and user flows in the same diagram: + +``` +CODE PATH COVERAGE +=========================== +[+] src/services/billing.ts + │ + ├── processPayment() + │ ├── [★★★ TESTED] Happy path + card declined + timeout — billing.test.ts:42 + │ ├── [GAP] Network timeout — NO TEST + │ └── [GAP] Invalid currency — NO TEST + │ + └── refundPayment() + ├── [★★ TESTED] Full refund — billing.test.ts:89 + └── [★ TESTED] Partial refund (checks non-throw only) — billing.test.ts:101 + +USER FLOW COVERAGE +=========================== +[+] Payment checkout flow + │ + ├── [★★★ TESTED] Complete purchase — checkout.e2e.ts:15 + ├── [GAP] Double-click submit — NO TEST + ├── [GAP] Navigate away during payment — NO TEST + └── [★ TESTED] Form validation errors (checks render only) — checkout.test.ts:40 + +[+] Error states + │ + ├── [★★ TESTED] Card declined message — billing.test.ts:58 + ├── [GAP] Network timeout UX (what does user see?) — NO TEST + └── [GAP] Empty cart submission — NO TEST + +───────────────────────────────── +COVERAGE: 5/12 paths tested (42%) + Code paths: 3/5 (60%) + User flows: 2/7 (29%) +QUALITY: ★★★: 2 ★★: 2 ★: 1 +GAPS: 7 paths need tests +───────────────────────────────── +``` + +**Fast path:** All paths covered → "Step 3.4: All new code paths have test coverage ✓" Continue. + +**5. Generate tests for uncovered paths:** + +If test framework detected (or bootstrapped in Step 2.5): +- Prioritize error handlers and edge cases first (happy paths are more likely already tested) +- Read 2-3 existing test files to match conventions exactly +- Generate unit tests. Mock all external dependencies (DB, API, Redis). +- Write tests that exercise the specific uncovered path with real assertions +- Run each test. Passes → commit as `test: coverage for {feature}` +- Fails → fix once. Still fails → revert, note gap in diagram. + +Caps: 30 code paths max, 20 tests generated max (code + user flow combined), 2-min per-test exploration cap. + +If no test framework AND user declined bootstrap → diagram only, no generation. Note: "Test generation skipped — no test framework configured." + +**Diff is test-only changes:** Skip Step 3.4 entirely: "No new application code paths to audit." + +**6. After-count and coverage summary:** + +```bash +# Count test files after generation +find . -name '*.test.*' -o -name '*.spec.*' -o -name '*_test.*' -o -name '*_spec.*' | grep -v node_modules | wc -l +``` + +For PR body: `Tests: {before} → {after} (+{delta} new)` +Coverage line: `Test Coverage Audit: N new code paths. M covered (X%). K tests generated, J committed.` + +--- + ## Step 3.5: Pre-Landing Review Review the diff for structural issues that tests don't catch. @@ -522,6 +819,10 @@ gh pr create --base --title ": " --body "$(cat <<'EOF' ## Summary +## Test Coverage + + + ## Pre-Landing Review @@ -563,4 +864,5 @@ EOF - **Split commits for bisectability** — each commit = one logical change. - **TODOS.md completion detection must be conservative.** Only mark items as completed when the diff clearly shows the work is done. - **Use Greptile reply templates from greptile-triage.md.** Every reply includes evidence (inline diff, code references, re-rank suggestion). Never post vague replies. +- **Step 3.4 generates coverage tests.** They must pass before committing. Never commit failing tests. - **The goal is: user says `/ship`, next thing they see is the review + PR URL.** diff --git a/ship/SKILL.md.tmpl b/ship/SKILL.md.tmpl index bb6db1583c7917a82434c30d336166b68419756c..e059fc6ac90c3a607a2dd0a5a329fd53af5cdedb 100644 --- a/ship/SKILL.md.tmpl +++ b/ship/SKILL.md.tmpl @@ -11,6 +11,7 @@ allowed-tools: - Grep - Glob - AskUserQuestion + - WebSearch --- {{PREAMBLE}} @@ -39,6 +40,7 @@ You are running the `/ship` workflow. This is a **non-interactive, fully automat - Multi-file changesets (auto-split into bisectable commits) - TODOS.md completed-item detection (auto-mark) - Auto-fixable review findings (dead code, N+1, stale comments — fixed automatically) +- Test coverage gaps (auto-generate and commit, or flag in PR body) --- @@ -92,6 +94,12 @@ git fetch origin && git merge origin/ --no-edit --- +## Step 2.5: Test Framework Bootstrap + +{{TEST_BOOTSTRAP}} + +--- + ## Step 3: Run tests (on merged code) **Do NOT run `RAILS_ENV=test bin/rails db:migrate`** — `bin/test-lane` already calls @@ -176,6 +184,144 @@ If multiple suites need to run, run them sequentially (each needs a test lane). --- +## Step 3.4: Test Coverage Audit + +100% coverage is the goal — every untested path is a path where bugs hide and vibe coding becomes yolo coding. Evaluate what was ACTUALLY coded (from the diff), not what was planned. + +**0. Before/after test count:** + +```bash +# Count test files before any generation +find . -name '*.test.*' -o -name '*.spec.*' -o -name '*_test.*' -o -name '*_spec.*' | grep -v node_modules | wc -l +``` + +Store this number for the PR body. + +**1. Trace every codepath changed** using `git diff origin/...HEAD`: + +Read every changed file. For each one, trace how data flows through the code — don't just list functions, actually follow the execution: + +1. **Read the diff.** For each changed file, read the full file (not just the diff hunk) to understand context. +2. **Trace data flow.** Starting from each entry point (route handler, exported function, event listener, component render), follow the data through every branch: + - Where does input come from? (request params, props, database, API call) + - What transforms it? (validation, mapping, computation) + - Where does it go? (database write, API response, rendered output, side effect) + - What can go wrong at each step? (null/undefined, invalid input, network failure, empty collection) +3. **Diagram the execution.** For each changed file, draw an ASCII diagram showing: + - Every function/method that was added or modified + - Every conditional branch (if/else, switch, ternary, guard clause, early return) + - Every error path (try/catch, rescue, error boundary, fallback) + - Every call to another function (trace into it — does IT have untested branches?) + - Every edge: what happens with null input? Empty array? Invalid type? + +This is the critical step — you're building a map of every line of code that can execute differently based on input. Every branch in this diagram needs a test. + +**2. Map user flows, interactions, and error states:** + +Code coverage isn't enough — you need to cover how real users interact with the changed code. For each changed feature, think through: + +- **User flows:** What sequence of actions does a user take that touches this code? Map the full journey (e.g., "user clicks 'Pay' → form validates → API call → success/failure screen"). Each step in the journey needs a test. +- **Interaction edge cases:** What happens when the user does something unexpected? + - Double-click/rapid resubmit + - Navigate away mid-operation (back button, close tab, click another link) + - Submit with stale data (page sat open for 30 minutes, session expired) + - Slow connection (API takes 10 seconds — what does the user see?) + - Concurrent actions (two tabs, same form) +- **Error states the user can see:** For every error the code handles, what does the user actually experience? + - Is there a clear error message or a silent failure? + - Can the user recover (retry, go back, fix input) or are they stuck? + - What happens with no network? With a 500 from the API? With invalid data from the server? +- **Empty/zero/boundary states:** What does the UI show with zero results? With 10,000 results? With a single character input? With maximum-length input? + +Add these to your diagram alongside the code branches. A user flow with no test is just as much a gap as an untested if/else. + +**3. Check each branch against existing tests:** + +Go through your diagram branch by branch — both code paths AND user flows. For each one, search for a test that exercises it: +- Function `processPayment()` → look for `billing.test.ts`, `billing.spec.ts`, `test/billing_test.rb` +- An if/else → look for tests covering BOTH the true AND false path +- An error handler → look for a test that triggers that specific error condition +- A call to `helperFn()` that has its own branches → those branches need tests too +- A user flow → look for an integration or E2E test that walks through the journey +- An interaction edge case → look for a test that simulates the unexpected action + +Quality scoring rubric: +- ★★★ Tests behavior with edge cases AND error paths +- ★★ Tests correct behavior, happy path only +- ★ Smoke test / existence check / trivial assertion (e.g., "it renders", "it doesn't throw") + +**4. Output ASCII coverage diagram:** + +Include BOTH code paths and user flows in the same diagram: + +``` +CODE PATH COVERAGE +=========================== +[+] src/services/billing.ts + │ + ├── processPayment() + │ ├── [★★★ TESTED] Happy path + card declined + timeout — billing.test.ts:42 + │ ├── [GAP] Network timeout — NO TEST + │ └── [GAP] Invalid currency — NO TEST + │ + └── refundPayment() + ├── [★★ TESTED] Full refund — billing.test.ts:89 + └── [★ TESTED] Partial refund (checks non-throw only) — billing.test.ts:101 + +USER FLOW COVERAGE +=========================== +[+] Payment checkout flow + │ + ├── [★★★ TESTED] Complete purchase — checkout.e2e.ts:15 + ├── [GAP] Double-click submit — NO TEST + ├── [GAP] Navigate away during payment — NO TEST + └── [★ TESTED] Form validation errors (checks render only) — checkout.test.ts:40 + +[+] Error states + │ + ├── [★★ TESTED] Card declined message — billing.test.ts:58 + ├── [GAP] Network timeout UX (what does user see?) — NO TEST + └── [GAP] Empty cart submission — NO TEST + +───────────────────────────────── +COVERAGE: 5/12 paths tested (42%) + Code paths: 3/5 (60%) + User flows: 2/7 (29%) +QUALITY: ★★★: 2 ★★: 2 ★: 1 +GAPS: 7 paths need tests +───────────────────────────────── +``` + +**Fast path:** All paths covered → "Step 3.4: All new code paths have test coverage ✓" Continue. + +**5. Generate tests for uncovered paths:** + +If test framework detected (or bootstrapped in Step 2.5): +- Prioritize error handlers and edge cases first (happy paths are more likely already tested) +- Read 2-3 existing test files to match conventions exactly +- Generate unit tests. Mock all external dependencies (DB, API, Redis). +- Write tests that exercise the specific uncovered path with real assertions +- Run each test. Passes → commit as `test: coverage for {feature}` +- Fails → fix once. Still fails → revert, note gap in diagram. + +Caps: 30 code paths max, 20 tests generated max (code + user flow combined), 2-min per-test exploration cap. + +If no test framework AND user declined bootstrap → diagram only, no generation. Note: "Test generation skipped — no test framework configured." + +**Diff is test-only changes:** Skip Step 3.4 entirely: "No new application code paths to audit." + +**6. After-count and coverage summary:** + +```bash +# Count test files after generation +find . -name '*.test.*' -o -name '*.spec.*' -o -name '*_test.*' -o -name '*_spec.*' | grep -v node_modules | wc -l +``` + +For PR body: `Tests: {before} → {after} (+{delta} new)` +Coverage line: `Test Coverage Audit: N new code paths. M covered (X%). K tests generated, J committed.` + +--- + ## Step 3.5: Pre-Landing Review Review the diff for structural issues that tests don't catch. @@ -404,6 +550,10 @@ gh pr create --base --title ": " --body "$(cat <<'EOF' ## Summary +## Test Coverage + + + ## Pre-Landing Review @@ -445,4 +595,5 @@ EOF - **Split commits for bisectability** — each commit = one logical change. - **TODOS.md completion detection must be conservative.** Only mark items as completed when the diff clearly shows the work is done. - **Use Greptile reply templates from greptile-triage.md.** Every reply includes evidence (inline diff, code references, re-rank suggestion). Never post vague replies. +- **Step 3.4 generates coverage tests.** They must pass before committing. Never commit failing tests. - **The goal is: user says `/ship`, next thing they see is the review + PR URL.** diff --git a/test/skill-e2e.test.ts b/test/skill-e2e.test.ts index e50f688e6a09cfe2312f32e423c116c5cead6cec..2ea56da9eab34250f1685b312cd39fbc595beb56 100644 --- a/test/skill-e2e.test.ts +++ b/test/skill-e2e.test.ts @@ -2298,6 +2298,269 @@ Review the site at ${serverUrl}. Use --quick mode. Skip any AskUserQuestion call }, 420_000); }); +// --- Test Bootstrap E2E --- + +describeE2E('Test Bootstrap E2E', () => { + let bootstrapDir: string; + let bootstrapServer: ReturnType; + + beforeAll(() => { + bootstrapDir = fs.mkdtempSync(path.join(os.tmpdir(), 'skill-e2e-bootstrap-')); + setupBrowseShims(bootstrapDir); + + // Copy qa skill files + copyDirSync(path.join(ROOT, 'qa'), path.join(bootstrapDir, 'qa')); + + // Create a minimal Node.js project with NO test framework + fs.writeFileSync(path.join(bootstrapDir, 'package.json'), JSON.stringify({ + name: 'test-bootstrap-app', + version: '1.0.0', + type: 'module', + }, null, 2)); + + // Create a simple app file with a bug + fs.writeFileSync(path.join(bootstrapDir, 'app.js'), ` +export function add(a, b) { return a + b; } +export function subtract(a, b) { return a - b; } +export function divide(a, b) { return a / b; } // BUG: no zero check +`); + + // Create a simple HTML page with a bug + fs.writeFileSync(path.join(bootstrapDir, 'index.html'), ` + +Bootstrap Test + +

Test App

+ Broken Link + + + +`); + + // Init git repo + const run = (cmd: string, args: string[]) => + spawnSync(cmd, args, { cwd: bootstrapDir, stdio: 'pipe', timeout: 5000 }); + run('git', ['init']); + run('git', ['config', 'user.email', 'test@test.com']); + run('git', ['config', 'user.name', 'Test']); + run('git', ['add', '.']); + run('git', ['commit', '-m', 'initial commit']); + + // Serve from working directory + bootstrapServer = Bun.serve({ + port: 0, + hostname: '127.0.0.1', + fetch(req) { + const url = new URL(req.url); + let filePath = url.pathname === '/' ? '/index.html' : url.pathname; + filePath = filePath.replace(/^\//, ''); + const fullPath = path.join(bootstrapDir, filePath); + if (!fs.existsSync(fullPath)) { + return new Response('Not Found', { status: 404 }); + } + const content = fs.readFileSync(fullPath, 'utf-8'); + return new Response(content, { + headers: { 'Content-Type': 'text/html' }, + }); + }, + }); + }); + + afterAll(() => { + bootstrapServer?.stop(); + try { fs.rmSync(bootstrapDir, { recursive: true, force: true }); } catch {} + }); + + test('/qa bootstrap + regression test on zero-test project', async () => { + const serverUrl = `http://127.0.0.1:${bootstrapServer!.port}`; + + const result = await runSkillTest({ + prompt: `You have a browse binary at ${browseBin}. Assign it to B variable like: B="${browseBin}" + +Read the file qa/SKILL.md for the QA workflow instructions. + +Run a Quick-tier QA test on ${serverUrl} +The source code for this page is at ${bootstrapDir}/index.html — you can fix bugs there. +Do NOT use AskUserQuestion — for any AskUserQuestion prompts, choose the RECOMMENDED option automatically. +Write your report to ${bootstrapDir}/qa-reports/qa-report.md + +This project has NO test framework. When the bootstrap asks, pick vitest (option A). +This is a test+fix loop: find bugs, fix them, write regression tests, commit each fix.`, + workingDirectory: bootstrapDir, + maxTurns: 50, + allowedTools: ['Bash', 'Read', 'Write', 'Edit', 'Glob', 'Grep'], + timeout: 420_000, + testName: 'qa-bootstrap', + runId, + }); + + logCost('/qa bootstrap', result); + recordE2E('/qa bootstrap + regression test', 'Test Bootstrap E2E', result, { + passed: ['success', 'error_max_turns'].includes(result.exitReason), + }); + + expect(['success', 'error_max_turns']).toContain(result.exitReason); + + // Verify bootstrap created test infrastructure + const hasTestConfig = fs.existsSync(path.join(bootstrapDir, 'vitest.config.ts')) + || fs.existsSync(path.join(bootstrapDir, 'vitest.config.js')) + || fs.existsSync(path.join(bootstrapDir, 'jest.config.js')) + || fs.existsSync(path.join(bootstrapDir, 'jest.config.ts')); + console.log(`Test config created: ${hasTestConfig}`); + + const hasTestingMd = fs.existsSync(path.join(bootstrapDir, 'TESTING.md')); + console.log(`TESTING.md created: ${hasTestingMd}`); + + // Check for bootstrap commit + const gitLog = spawnSync('git', ['log', '--oneline', '--grep=bootstrap'], { + cwd: bootstrapDir, stdio: 'pipe', + }); + const bootstrapCommits = gitLog.stdout.toString().trim(); + console.log(`Bootstrap commits: ${bootstrapCommits || 'none'}`); + + // Check for regression test commits + const regressionLog = spawnSync('git', ['log', '--oneline', '--grep=test(qa)'], { + cwd: bootstrapDir, stdio: 'pipe', + }); + const regressionCommits = regressionLog.stdout.toString().trim(); + console.log(`Regression test commits: ${regressionCommits || 'none'}`); + + // Verify at least the bootstrap happened (fix commits are bonus) + const allCommits = spawnSync('git', ['log', '--oneline'], { + cwd: bootstrapDir, stdio: 'pipe', + }); + const totalCommits = allCommits.stdout.toString().trim().split('\n').length; + console.log(`Total commits: ${totalCommits}`); + expect(totalCommits).toBeGreaterThan(1); // At least initial + bootstrap + }, 420_000); +}); + +// --- Test Coverage Audit E2E --- + +describeE2E('Test Coverage Audit E2E', () => { + let coverageDir: string; + + beforeAll(() => { + coverageDir = fs.mkdtempSync(path.join(os.tmpdir(), 'skill-e2e-coverage-')); + + // Copy ship skill files + copyDirSync(path.join(ROOT, 'ship'), path.join(coverageDir, 'ship')); + copyDirSync(path.join(ROOT, 'review'), path.join(coverageDir, 'review')); + + // Create a Node.js project WITH test framework but coverage gaps + fs.writeFileSync(path.join(coverageDir, 'package.json'), JSON.stringify({ + name: 'test-coverage-app', + version: '1.0.0', + type: 'module', + scripts: { test: 'echo "no tests yet"' }, + devDependencies: { vitest: '^1.0.0' }, + }, null, 2)); + + // Create vitest config + fs.writeFileSync(path.join(coverageDir, 'vitest.config.ts'), + `import { defineConfig } from 'vitest/config';\nexport default defineConfig({ test: {} });\n`); + + fs.writeFileSync(path.join(coverageDir, 'VERSION'), '0.1.0.0\n'); + fs.writeFileSync(path.join(coverageDir, 'CHANGELOG.md'), '# Changelog\n'); + + // Create source file with multiple code paths + fs.mkdirSync(path.join(coverageDir, 'src'), { recursive: true }); + fs.writeFileSync(path.join(coverageDir, 'src', 'billing.ts'), ` +export function processPayment(amount: number, currency: string) { + if (amount <= 0) throw new Error('Invalid amount'); + if (currency !== 'USD' && currency !== 'EUR') throw new Error('Unsupported currency'); + return { status: 'success', amount, currency }; +} + +export function refundPayment(paymentId: string, reason: string) { + if (!paymentId) throw new Error('Payment ID required'); + if (!reason) throw new Error('Reason required'); + return { status: 'refunded', paymentId, reason }; +} +`); + + // Create a test directory with ONE test (partial coverage) + fs.mkdirSync(path.join(coverageDir, 'test'), { recursive: true }); + fs.writeFileSync(path.join(coverageDir, 'test', 'billing.test.ts'), ` +import { describe, test, expect } from 'vitest'; +import { processPayment } from '../src/billing'; + +describe('processPayment', () => { + test('processes valid payment', () => { + const result = processPayment(100, 'USD'); + expect(result.status).toBe('success'); + }); + // GAP: no test for invalid amount + // GAP: no test for unsupported currency + // GAP: refundPayment not tested at all +}); +`); + + // Init git repo with main branch + const run = (cmd: string, args: string[]) => + spawnSync(cmd, args, { cwd: coverageDir, stdio: 'pipe', timeout: 5000 }); + run('git', ['init', '-b', 'main']); + run('git', ['config', 'user.email', 'test@test.com']); + run('git', ['config', 'user.name', 'Test']); + run('git', ['add', '.']); + run('git', ['commit', '-m', 'initial commit']); + + // Create feature branch + run('git', ['checkout', '-b', 'feature/billing']); + }); + + afterAll(() => { + try { fs.rmSync(coverageDir, { recursive: true, force: true }); } catch {} + }); + + test('/ship Step 3.4 produces coverage diagram', async () => { + const result = await runSkillTest({ + prompt: `Read the file ship/SKILL.md for the ship workflow instructions. + +You are on the feature/billing branch. The base branch is main. +This is a test project — there is no remote, no PR to create. + +ONLY run Step 3.4 (Test Coverage Audit) from the ship workflow. +Skip all other steps (tests, evals, review, version, changelog, commit, push, PR). + +The source code is in ${coverageDir}/src/billing.ts. +Existing tests are in ${coverageDir}/test/billing.test.ts. +The test command is: echo "tests pass" (mocked — just pretend tests pass). + +Produce the ASCII coverage diagram showing which code paths are tested and which have gaps. +Do NOT generate new tests — just produce the diagram and coverage summary. +Output the diagram directly.`, + workingDirectory: coverageDir, + maxTurns: 15, + allowedTools: ['Bash', 'Read', 'Write', 'Edit', 'Glob', 'Grep'], + timeout: 120_000, + testName: 'ship-coverage-audit', + runId, + }); + + logCost('/ship coverage audit', result); + recordE2E('/ship Step 3.4 coverage audit', 'Test Coverage Audit E2E', result, { + passed: result.exitReason === 'success', + }); + + expect(result.exitReason).toBe('success'); + + // Check output contains coverage diagram elements + const output = result.output || ''; + const hasGap = output.includes('GAP') || output.includes('gap') || output.includes('NO TEST'); + const hasTested = output.includes('TESTED') || output.includes('tested') || output.includes('✓'); + const hasCoverage = output.includes('COVERAGE') || output.includes('coverage') || output.includes('paths tested'); + + console.log(`Output has GAP markers: ${hasGap}`); + console.log(`Output has TESTED markers: ${hasTested}`); + console.log(`Output has coverage summary: ${hasCoverage}`); + + // At minimum, the agent should have read the source and test files + const readCalls = result.toolCalls.filter(tc => tc.tool === 'Read'); + expect(readCalls.length).toBeGreaterThan(0); + }, 180_000); +}); + // Module-level afterAll — finalize eval collector after all tests complete afterAll(async () => { if (evalCollector) { diff --git a/test/skill-validation.test.ts b/test/skill-validation.test.ts index 4231a91df463f3bdb2eaa70f56b92735e293e50f..54e03a4d3942424d0febcd3ea6022cc531851e12 100644 --- a/test/skill-validation.test.ts +++ b/test/skill-validation.test.ts @@ -737,3 +737,225 @@ describe('gstack-slug', () => { expect(lines[1]).toMatch(/^BRANCH=.+/); }); }); + +// --- Test Bootstrap validation --- + +describe('Test Bootstrap ({{TEST_BOOTSTRAP}}) integration', () => { + test('TEST_BOOTSTRAP resolver produces valid content', () => { + const qaContent = fs.readFileSync(path.join(ROOT, 'qa', 'SKILL.md'), 'utf-8'); + expect(qaContent).toContain('Test Framework Bootstrap'); + expect(qaContent).toContain('RUNTIME:ruby'); + expect(qaContent).toContain('RUNTIME:node'); + expect(qaContent).toContain('RUNTIME:python'); + expect(qaContent).toContain('no-test-bootstrap'); + expect(qaContent).toContain('BOOTSTRAP_DECLINED'); + }); + + test('TEST_BOOTSTRAP appears in qa/SKILL.md', () => { + const content = fs.readFileSync(path.join(ROOT, 'qa', 'SKILL.md'), 'utf-8'); + expect(content).toContain('Test Framework Bootstrap'); + expect(content).toContain('TESTING.md'); + expect(content).toContain('CLAUDE.md'); + }); + + test('TEST_BOOTSTRAP appears in ship/SKILL.md', () => { + const content = fs.readFileSync(path.join(ROOT, 'ship', 'SKILL.md'), 'utf-8'); + expect(content).toContain('Test Framework Bootstrap'); + expect(content).toContain('Step 2.5'); + }); + + test('TEST_BOOTSTRAP appears in qa-design-review/SKILL.md', () => { + const content = fs.readFileSync(path.join(ROOT, 'qa-design-review', 'SKILL.md'), 'utf-8'); + expect(content).toContain('Test Framework Bootstrap'); + }); + + test('TEST_BOOTSTRAP does NOT appear in qa-only/SKILL.md', () => { + const content = fs.readFileSync(path.join(ROOT, 'qa-only', 'SKILL.md'), 'utf-8'); + expect(content).not.toContain('Test Framework Bootstrap'); + // But should have the recommendation note + expect(content).toContain('No test framework detected'); + expect(content).toContain('Run `/qa` to bootstrap'); + }); + + test('bootstrap includes framework knowledge table', () => { + const content = fs.readFileSync(path.join(ROOT, 'qa', 'SKILL.md'), 'utf-8'); + expect(content).toContain('vitest'); + expect(content).toContain('minitest'); + expect(content).toContain('pytest'); + expect(content).toContain('cargo test'); + expect(content).toContain('phpunit'); + expect(content).toContain('ExUnit'); + }); + + test('bootstrap includes CI/CD pipeline generation', () => { + const content = fs.readFileSync(path.join(ROOT, 'qa', 'SKILL.md'), 'utf-8'); + expect(content).toContain('.github/workflows/test.yml'); + expect(content).toContain('GitHub Actions'); + }); + + test('bootstrap includes first real tests step', () => { + const content = fs.readFileSync(path.join(ROOT, 'qa', 'SKILL.md'), 'utf-8'); + expect(content).toContain('First real tests'); + expect(content).toContain('git log --since=30.days'); + expect(content).toContain('Prioritize by risk'); + }); + + test('bootstrap includes vibe coding philosophy', () => { + const content = fs.readFileSync(path.join(ROOT, 'qa', 'SKILL.md'), 'utf-8'); + expect(content).toContain('vibe coding'); + expect(content).toContain('100% test coverage'); + }); + + test('WebSearch is in allowed-tools for qa, ship, qa-design-review', () => { + const qa = fs.readFileSync(path.join(ROOT, 'qa', 'SKILL.md'), 'utf-8'); + const ship = fs.readFileSync(path.join(ROOT, 'ship', 'SKILL.md'), 'utf-8'); + const qaDesign = fs.readFileSync(path.join(ROOT, 'qa-design-review', 'SKILL.md'), 'utf-8'); + expect(qa).toContain('WebSearch'); + expect(ship).toContain('WebSearch'); + expect(qaDesign).toContain('WebSearch'); + }); +}); + +// --- Phase 8e.5 regression test validation --- + +describe('Phase 8e.5 regression test generation', () => { + test('qa/SKILL.md contains Phase 8e.5', () => { + const content = fs.readFileSync(path.join(ROOT, 'qa', 'SKILL.md'), 'utf-8'); + expect(content).toContain('8e.5. Regression Test'); + expect(content).toContain('test(qa): regression test'); + expect(content).toContain('WTF-likelihood exclusion'); + }); + + test('qa/SKILL.md Rule 13 is amended for regression tests', () => { + const content = fs.readFileSync(path.join(ROOT, 'qa', 'SKILL.md'), 'utf-8'); + expect(content).toContain('Only modify tests when generating regression tests in Phase 8e.5'); + expect(content).not.toContain('Never modify tests or CI configuration'); + }); + + test('qa-design-review has CSS-aware Phase 8e.5 variant', () => { + const content = fs.readFileSync(path.join(ROOT, 'qa-design-review', 'SKILL.md'), 'utf-8'); + expect(content).toContain('8e.5. Regression Test (design-review variant)'); + expect(content).toContain('CSS-only'); + expect(content).toContain('test(design): regression test'); + }); + + test('regression test includes full attribution comment format', () => { + const content = fs.readFileSync(path.join(ROOT, 'qa', 'SKILL.md'), 'utf-8'); + expect(content).toContain('// Regression: ISSUE-NNN'); + expect(content).toContain('// Found by /qa on'); + expect(content).toContain('// Report: .gstack/qa-reports/'); + }); + + test('regression test uses auto-incrementing names', () => { + const content = fs.readFileSync(path.join(ROOT, 'qa', 'SKILL.md'), 'utf-8'); + expect(content).toContain('auto-incrementing'); + expect(content).toContain('max number + 1'); + }); +}); + +// --- Step 3.4 coverage audit validation --- + +describe('Step 3.4 test coverage audit', () => { + test('ship/SKILL.md contains Step 3.4', () => { + const content = fs.readFileSync(path.join(ROOT, 'ship', 'SKILL.md'), 'utf-8'); + expect(content).toContain('Step 3.4: Test Coverage Audit'); + expect(content).toContain('CODE PATH COVERAGE'); + }); + + test('Step 3.4 includes quality scoring rubric', () => { + const content = fs.readFileSync(path.join(ROOT, 'ship', 'SKILL.md'), 'utf-8'); + expect(content).toContain('★★★'); + expect(content).toContain('★★'); + expect(content).toContain('edge cases AND error paths'); + expect(content).toContain('happy path only'); + }); + + test('Step 3.4 includes before/after test count', () => { + const content = fs.readFileSync(path.join(ROOT, 'ship', 'SKILL.md'), 'utf-8'); + expect(content).toContain('Count test files before'); + expect(content).toContain('Count test files after'); + }); + + test('ship PR body includes Test Coverage section', () => { + const content = fs.readFileSync(path.join(ROOT, 'ship', 'SKILL.md'), 'utf-8'); + expect(content).toContain('## Test Coverage'); + }); + + test('ship rules include test generation rule', () => { + const content = fs.readFileSync(path.join(ROOT, 'ship', 'SKILL.md'), 'utf-8'); + expect(content).toContain('Step 3.4 generates coverage tests'); + expect(content).toContain('Never commit failing tests'); + }); + + test('Step 3.4 includes vibe coding philosophy', () => { + const content = fs.readFileSync(path.join(ROOT, 'ship', 'SKILL.md'), 'utf-8'); + expect(content).toContain('vibe coding becomes yolo coding'); + }); + + test('Step 3.4 traces actual codepaths, not just syntax', () => { + const content = fs.readFileSync(path.join(ROOT, 'ship', 'SKILL.md'), 'utf-8'); + expect(content).toContain('Trace every codepath'); + expect(content).toContain('Trace data flow'); + expect(content).toContain('Diagram the execution'); + }); + + test('Step 3.4 maps user flows and interaction edge cases', () => { + const content = fs.readFileSync(path.join(ROOT, 'ship', 'SKILL.md'), 'utf-8'); + expect(content).toContain('Map user flows'); + expect(content).toContain('Interaction edge cases'); + expect(content).toContain('Double-click'); + expect(content).toContain('Navigate away'); + expect(content).toContain('Error states the user can see'); + expect(content).toContain('Empty/zero/boundary states'); + }); + + test('Step 3.4 diagram includes USER FLOW COVERAGE section', () => { + const content = fs.readFileSync(path.join(ROOT, 'ship', 'SKILL.md'), 'utf-8'); + expect(content).toContain('USER FLOW COVERAGE'); + expect(content).toContain('Code paths:'); + expect(content).toContain('User flows:'); + }); +}); + +// --- Retro test health validation --- + +describe('Retro test health tracking', () => { + test('retro/SKILL.md has test health data gathering commands', () => { + const content = fs.readFileSync(path.join(ROOT, 'retro', 'SKILL.md'), 'utf-8'); + expect(content).toContain('# 10. Test file count'); + expect(content).toContain('# 11. Regression test commits'); + expect(content).toContain('# 12. Test files changed'); + }); + + test('retro/SKILL.md has Test Health metrics row', () => { + const content = fs.readFileSync(path.join(ROOT, 'retro', 'SKILL.md'), 'utf-8'); + expect(content).toContain('Test Health'); + expect(content).toContain('regression tests'); + }); + + test('retro/SKILL.md has Test Health narrative section', () => { + const content = fs.readFileSync(path.join(ROOT, 'retro', 'SKILL.md'), 'utf-8'); + expect(content).toContain('### Test Health'); + expect(content).toContain('Total test files'); + expect(content).toContain('vibe coding safe'); + }); + + test('retro JSON schema includes test_health field', () => { + const content = fs.readFileSync(path.join(ROOT, 'retro', 'SKILL.md'), 'utf-8'); + expect(content).toContain('test_health'); + expect(content).toContain('total_test_files'); + expect(content).toContain('regression_test_commits'); + }); +}); + +// --- QA report template regression tests section --- + +describe('QA report template', () => { + test('qa-report-template.md has Regression Tests section', () => { + const content = fs.readFileSync(path.join(ROOT, 'qa', 'templates', 'qa-report-template.md'), 'utf-8'); + expect(content).toContain('## Regression Tests'); + expect(content).toContain('committed / deferred / skipped'); + expect(content).toContain('### Deferred Tests'); + expect(content).toContain('**Precondition:**'); + }); +});