~cytrogen/gstack

ref: bd834aeadb284a41b3be451d2769fb047b45f7a0 gstack/TODOS.md -rw-r--r-- 22.0 KiB
c0f3c3a9 — Garry Tan a month ago
fix: security hardening + issue triage (v0.8.3) (#205)

* fix: check for bun before running setup (#147)

Users without bun installed got a cryptic "command not found" error.
Now prints a clear message with install instructions.

Closes #147

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: block SSRF via URL validation in browse commands (#17)

Adds validateNavigationUrl() that blocks non-HTTP(S) schemes (file://,
javascript:, data:) and cloud metadata endpoints (169.254.169.254,
metadata.google.internal). Applied to goto, diff, and newTab commands.
Localhost and private IPs remain allowed for local dev QA.

Closes #17

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: replace eval $(gstack-slug) with source <(...) (#133)

Eliminates unnecessary use of eval across all skill templates and
generated files. source <(...) has identical behavior without the
shell injection surface. Also hardens gstack-diff-scope usage.

Closes #133

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: rename /debug to /investigate to avoid Claude Code conflict (#190)

Claude Code has a built-in /debug command that shadows the gstack skill.
Renaming to /investigate which better reflects the systematic root-cause
investigation methodology.

Closes #190

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* test: add unit tests for path validation helpers

validateOutputPath() and validateReadPath() are security-critical
functions with zero test coverage. Adds 14 tests covering safe paths,
traversal attacks, and prefix collision edge cases.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* chore: bump version and changelog (v0.8.3)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* docs: update /debug → /investigate references in docs

CLAUDE.md, README.md, and docs/skills.md still referenced the old
/debug skill name after the rename.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: harden URL validation against hostname bypasses (Codex P1)

Codex review found that metadata IPs could be reached via hex
(0xA9FEA9FE), decimal (2852039166), octal, trailing dot, and IPv6
bracket forms. Now normalizes hostnames before checking the blocklist
and probes numeric IP representations via URL constructor.

Also moves URL validation before page allocation in newTab() to
prevent zombie tabs on rejection (Codex P3).

5 new test cases for bypass variants.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
3a315b33 — Garry Tan a month ago
docs: rewrite README + skills docs, auto-invoke /document-release (v0.8.4) (#207)

* docs: add 6 missing skills to proactive suggestion list

Add /codex, /careful, /freeze, /guard, /unfreeze, /gstack-upgrade to the
root SKILL.md.tmpl proactive suggestion list so Claude suggests them at
the appropriate workflow stages.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* docs: add 6 new skill entries + browse handoff to docs

- docs/skills.md: add /codex, /careful, /freeze, /guard, /unfreeze,
  /gstack-upgrade to skill table with deep-dive sections. Group safety
  skills into one "Safety & Guardrails" section. Add browse handoff
  subsection to /browse deep-dive.
- BROWSER.md: add handoff/resume to command reference table + section.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* docs: add power tools section + update skill lists in README

- Update prose: "Fifteen specialists and six power tools"
- Add power tools table after sprint specialists: /codex, /careful,
  /freeze, /guard, /unfreeze, /gstack-upgrade
- Update all 4 skill list locations (install Step 1, Step 2,
  troubleshooting CLAUDE.md example) to include all 21 skills

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* docs: add v0.7-v0.8.2 features to README "What's new" section

Add paragraphs for browse handoff, /codex multi-AI review, safety
guardrails (/careful, /freeze, /guard), proactive skill suggestions,
and /ship auto-invoking /document-release.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: auto-invoke /document-release after /ship PR creation

Add Step 8.5 to /ship that automatically reads document-release/SKILL.md
and executes the doc update workflow after creating the PR. This prevents
documentation drift — /ship now keeps docs current without a separate
command.

Completes P1 TODO: "Auto-invoke /document-release from /ship"

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* chore: bump version and changelog (v0.8.4)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2d97ab99 — Garry Tan a month ago
feat: browse handoff — headless-to-headed browser switching (v0.7.4) (#201)

* feat: browse handoff — headless-to-headed browser switching

Add `handoff` and `resume` commands that let users take over a visible
Chrome when the headless browser gets stuck (CAPTCHAs, auth walls, MFA).

Architecture: launch-first-close-second for safe rollback. State transfer
via extracted saveState()/restoreState() helpers (DRY with recreateContext).
Auto-handoff hint after 3 consecutive command failures.

* test: handoff unit + integration tests (15 tests)

Covers saveState/restoreState, failure tracking, edge cases (already
headed, resume without handoff), and full integration flow with cookie
and tab preservation across headless-to-headed switch.

* docs: handoff section in browse template + TODOS update

Add User Handoff section to browse/SKILL.md.tmpl with usage examples.
Update State Persistence TODO noting saveState/restoreState reusability.

* chore: bump version and changelog (v0.7.4)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
c4f679d8 — Garry Tan a month ago
feat: safety hook skills + skill usage telemetry (v0.7.1) (#189)

* feat: add /careful, /freeze, /guard, /unfreeze safety hook skills

Four new on-demand skills using Claude Code's PreToolUse hooks:
- /careful: warns before destructive commands (rm -rf, DROP TABLE, force-push, etc.)
- /freeze: blocks file edits outside a specified directory
- /guard: composes both into one command
- /unfreeze: clears freeze boundary without ending session

Pure bash hook scripts with Python fallback for JSON edge cases.
Safe exceptions for build artifacts (node_modules, dist, .next, etc.).
Hook fire telemetry logs pattern name only (never command content).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: add skill usage telemetry to preamble

TemplateContext system passes skill name through resolver pipeline so
each generated SKILL.md gets its own name baked into the telemetry line.
Appends to ~/.gstack/analytics/skill-usage.jsonl on every invocation.

Covers 14 preamble-using skills + 4 hook skills (inline telemetry).
JSONL format: {"skill":"ship","ts":"...","repo":"my-project"}

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: add analytics CLI for skill usage stats

bun run analytics reads ~/.gstack/analytics/skill-usage.jsonl and shows
top skills, per-repo breakdown, hook fire stats, and daily timeline.
Supports --period 7d/30d/all. Handles missing/empty/malformed data.

22 unit tests cover parsing, filtering, formatting, and edge cases.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: add skills-used-this-week to /retro

Retro Step 2 now reads skill-usage.jsonl and shows which gstack skills
were used during the retro window. Follows the same pattern as the
Greptile signal and Backlog Health metrics — read file, filter by date,
aggregate, present. Skips silently if no analytics data exists.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* test: add hook script and telemetry tests

32 unit tests for check-careful.sh covering all 8 destructive patterns,
safe exceptions, Python fallback, and malformed input handling.
7 unit tests for check-freeze.sh covering boundary enforcement,
trailing slash edge case, and missing state file.
Telemetry tests verify per-skill name correctness in generated output.
Adds careful/freeze/guard/unfreeze/document-release to ALL_SKILLS.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* chore: bump version to 0.6.5 + changelog + mark TODOs shipped

Safety hook skills and skill usage telemetry shipped.
Analytics CLI and /retro integration included.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: /debug auto-freezes edits to the module being debugged

Add PreToolUse hooks (Edit/Write) to debug/SKILL.md.tmpl that reference
the existing freeze/bin/check-freeze.sh. After Phase 1 investigation,
/debug locks edits to the narrowest affected directory.

Graceful degradation: if freeze script is unavailable, scope lock is
skipped. Users can run /unfreeze to remove the restriction.

Deferred 6 enhancements to TODOS.md, gated on telemetry showing the
freeze hook actually fires in real debugging sessions.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
6000af45 — Garry Tan a month ago
feat: founder discovery engine + /debug skill — v0.7.0 (#185)

* feat: add escalation protocol to preamble — all skills get DONE/BLOCKED/NEEDS_CONTEXT

Every skill now reports completion status (DONE, DONE_WITH_CONCERNS, BLOCKED,
NEEDS_CONTEXT) and has escalation rules: 3 failed attempts → STOP, security
uncertainty → STOP, scope exceeds verification → STOP.

"It is always OK to stop and say 'this is too hard for me.'"

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: add verification gate to /ship (Step 6.5) — no push without fresh evidence

Before pushing, re-verify tests if code changed during review fixes.
Rationalization prevention: "Should work now" → RUN IT.
"I'm confident" → Confidence is not evidence.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: add scope drift detection + verification of claims to /review

Step 1.5: Before reviewing code quality, check if the diff matches stated
intent. Flags scope creep and missing requirements (INFORMATIONAL).

Step 5 addition: Every review claim must cite evidence — "this pattern is
safe" needs a line reference, "tests cover this" needs a test name.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: mandatory implementation alternatives + design doc lookup in /plan-ceo-review

Step 0C-bis: Every plan must consider 2-3 approaches (minimal viable vs ideal
architecture) before mode selection. RECOMMENDATION required.

Pre-Review System Audit now checks ~/.gstack/projects/ for /brainstorm design
docs (branch-filtered with fallback).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: design doc lookup in /plan-eng-review + fix branch name sanitization

Step 0 now checks ~/.gstack/projects/ for /brainstorm design docs
(branch-filtered with fallback, reads Supersedes: for revision context).

Fix: branch names with '/' (e.g. garrytan/better-process) now get
sanitized via tr '/' '-' in test plan artifact filenames.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: new /brainstorm and /debug skills

/brainstorm: Socratic design exploration before planning. Context gathering,
clarifying questions (smart-skip), related design discovery (keyword grep),
premise challenge, forced alternatives, design doc artifact with lineage
tracking (Supersedes: field). Writes to ~/.gstack/projects/$SLUG/.

/debug: Systematic root-cause debugging. Iron Law: no fixes without root
cause investigation. Pattern analysis, hypothesis testing with 3-strike
escalation, structured DEBUG REPORT output.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* test: structural tests for new skills + escalation protocol assertions

Add brainstorm + debug to skillsWithUpdateCheck and skillsWithPreamble arrays.
Add structural tests: brainstorm (Phase 1-6, Design Doc, Supersedes, Smart-skip),
debug (Iron Law, Root Cause, Pattern Analysis, Hypothesis, DEBUG REPORT, 3-strike).
Add escalation protocol tests (DONE_WITH_CONCERNS, BLOCKED, NEEDS_CONTEXT) for
all preamble skills.

Also: 2 new TODOs (design docs → Supabase sync, /plan-design-review skill),
update CLAUDE.md project structure with new skill directories.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* chore: bump version and changelog (v0.6.0)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* chore: rename /brainstorm → /office-hours across references

Update CHANGELOG, CLAUDE.md, TODOS, design-consultation, plan-ceo-review,
and gen-skill-docs to reference the new office-hours skill name.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: YC Office Hours — dual-mode product diagnostic + builder brainstorm

Rewrite /office-hours with two modes:

Startup mode: six forcing questions (Demand Reality, Status Quo, Desperate
Specificity, Narrowest Wedge, Observation & Surprise, Future-Fit) that push
founders toward radical honesty about demand, users, and product decisions.
Includes smart routing by product stage, intrapreneurship adaptation, and
YC apply CTA for strong-signal founders.

Builder mode: generative brainstorming for side projects, hackathons,
learning, and open source. Enthusiastic collaborator tone, design thinking
questions, no business interrogation.

Mode is determined by an explicit question in Phase 1 — no guessing.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* test: add 14 assertions for YC Office Hours content coverage

Validates dual-mode structure (Startup/Builder), all six forcing questions,
builder brainstorming content, intrapreneurship adaptation, YC apply CTA,
and operating principles for both modes. 192 tests total, all passing.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* docs: update project documentation for v0.6.1

- README.md: added /office-hours and /debug to skills table, updated
  skill count from 13 to 15, added both to install instructions
- docs/skills.md: added /office-hours and /debug deep dive sections
- CLAUDE.md: updated office-hours description to reflect dual-mode
- CONTRIBUTING.md: updated skill count from 13 to 15
- CHANGELOG.md: added YC Office Hours and /debug entries to 0.6.0

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: founder discovery engine in /office-hours (v0.7.0)

Turn /office-hours into a YC founder discovery engine. Every session now
ends with three beats: signal reflection (specific callbacks to what the
user said), "One more thing." transition, and a personal plea from Garry
Tan with three tiers based on founder signal strength. Top tier uses
AskUserQuestion to ask directly and opens ycombinator.com/apply?ref=gstack.

Adds Phase 4.5 (Founder Signal Synthesis), "What I noticed about how you
think" section to both design doc templates, anti-slop GOOD/BAD examples,
and emotional targets per tier.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* test: add validation assertions for founder discovery engine

8 new assertions covering: YC apply CTA with ref=gstack tracking,
"What I noticed" design doc section, golden age framing, Garry Tan
personal plea, founder signal synthesis phase, three-tier decision
rubric, anti-slop GOOD/BAD examples, "One more thing" transition beat.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* docs: update project documentation for v0.7.0

VERSION: 0.6.4.1 → 0.7.0
CHANGELOG: new entry — Office Hours Gets Personal
README: updated /office-hours and /plan-design-review descriptions
docs/skills.md: updated /office-hours table + deep dive section
TODOS.md: added /yc-prep skill TODO (P2)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: remove duplicate Install section, fix stale skills lists, deduplicate CHANGELOG entries

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
bc86a665 — Garry Tan a month ago
feat: add trigger phrases to skill descriptions for better model matching (v0.6.4.1) (#169)

* feat: add trigger phrases to skill descriptions for better model matching

Anthropic's skill best practices: "the description field is not a summary —
it's when to trigger." Add explicit "Use when asked to..." phrases to 12 skill
descriptions so Claude's auto-discovery works with natural language requests
like "deploy this" or "check my diff", not just explicit /slash-commands.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* docs: add on-demand hooks and telemetry to TODOS.md

Captures two ideas from Anthropic's skill best practices post:
- /careful, /freeze, /guard on-demand hook skills (P3)
- Skill usage telemetry via preamble JSONL append (P3)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* chore: bump version and changelog (v0.6.4.1)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* docs: exclude internal details from CHANGELOG style guide

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
28becb3b — Garry Tan a month ago
feat: design review lite in /review and /ship + gstack-diff-scope (v0.6.3) (#142)

* feat: gstack-diff-scope helper + design review checklist

bin/gstack-diff-scope categorizes branch changes into SCOPE_FRONTEND,
SCOPE_BACKEND, SCOPE_PROMPTS, SCOPE_TESTS, SCOPE_DOCS, SCOPE_CONFIG.

review/design-checklist.md is a 20-item code-level checklist with
HIGH/MEDIUM/LOW confidence tags for detecting design anti-patterns
from source code.

* feat: integrate design review lite into /review and /ship

Add generateDesignReviewLite() resolver, insert {{DESIGN_REVIEW_LITE}}
partial in review Step 4.5 and ship Step 3.5. Update dashboard to
recognize design-review-lite entries. Ship pre-flight uses
gstack-diff-scope for smarter design review recommendations.

* test: E2E eval for design review lite detection

Planted CSS/HTML fixtures with 7 design anti-patterns. E2E test
verifies /review catches >= 4 of 7 (Papyrus font, 14px body text,
outline:none, !important, purple gradient, generic hero copy,
3-column feature grid).

* chore: bump version and changelog (v0.6.3.0)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
9d47619e — Garry Tan a month ago
feat: Completeness Principle — Boil the Lake (v0.6.1) (#140)

* feat: Completeness Principle — Boil the Lake (WIP, pre-merge)

Add Completeness Principle to all skill preambles, dual-time estimates,
compression table, anti-pattern gallery, Lake Score, and completeness
gaps review category. VERSION/CHANGELOG will be rebased after merge.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: update stale version reference in TODOS.md (v0.5.3 → v0.6.1)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* docs: update CHANGELOG date + README for v0.6.1 features

- Add date to CHANGELOG 0.6.1 entry
- Add Completeness Principle to README intro
- Add SELECTIVE EXPANSION mode to CEO review section
- Add test bootstrap mention to /ship section
- Fix uninstall command missing design-consultation in project uninstall
- Add "recommends shortcuts" and "no tests" to Without gstack list

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* docs: split README into lean intro + docs/ directory (gh CLI pattern)

README: 875 → 243 lines. Keeps intro, skill table, demo, install, and
troubleshooting. All per-skill deep dives, Greptile integration guide,
and contributor mode docs moved to docs/ directory.

- docs/skills.md — full philosophy and examples for all 13 skills
- docs/greptile.md — Greptile setup and triage workflow
- docs/contributor-mode.md — how to enable and use contributor mode
- README now links to docs/ via Documentation table
- Updated skill table entries with latest features (fix-first, regression
  tests, test health, completeness gaps)
- Updated demo transcript with AUTO-FIXED, coverage audit, regression test

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* docs: remove "competitor" language, rewrite README in Garry's voice

Replace "browses competitors" with "knows the landscape" / "what's out
there" throughout all user-facing copy. Trim README from 243 to 167
lines — tighter, more opinionated, less listicle energy. Remove
Completeness Principle from README top (it lives in CLAUDE.md and the
skill preambles where Claude actually reads it).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* docs: rewrite README in Garry's raw voice — AGI era, L8 factory, real stories

The README now sounds like Garry, not a product page. Leads with the
live experiment, the 16k LOC/day reality, the real-life coding stories
(Austin, hospital bedside). Highlights the newest unlocks (design at
the heart, /qa parallelism, smart review routing, test bootstrap).
Closes with an open invitation — free MIT, fork it, let's all ride
the wave together.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* docs: add Garry's bonafides to README intro — Palantir, Posterous, YC, 600k LOC

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* docs: add real /retro numbers — 140k lines, 362 commits across 3 projects

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* docs: add "in the last 60 days" timeframe to 600k LOC claim

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* docs: add GitHub contribution graphs — 2026 vs 2013 side by side

Same person, different era. 2013: 772 contributions building Bookface.
2026: 1,237 contributions and accelerating. The difference is the tooling.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* docs: clarify /retro stats are from last 7 days

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* docs: add designer/PM/eng manager roles to intro

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* docs: remove Josh/L8 reference from README

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* docs: move demo up, make it dramatically more impressive

Show the actual architecture diagram, auto-fixed issues, 100% coverage,
regression test generation. Punch line: "That is not a copilot. That is
a team."

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* docs: remove "My journey" section — intro already covers it

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* docs: prefix all skill commands with You: in demo transcript

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* docs: collapse You/Claude lines in demo — no gap between command and response

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* docs: clarify plan mode flow in demo — approve, exit, Claude implements

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* docs: move /ship to end of demo — review → QA → ship is the real flow

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* docs: add /plan-design-review to demo, tighten CEO response

Shorter CEO reply, compressed eng diagram, added design audit with
AI Slop score. Seven commands now: plan → eng → build → design →
review → QA → ship.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* docs: move design review before implementation — it's part of planning

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* docs: reorder demo — design before eng, after CEO

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* docs: remove URL from /plan-design-review in demo

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* docs: add [...] annotations showing what actually happens at each step

Each step now shows what the agent does under the hood: 8 expansion
proposals cherry-picked, 80-item design audit, ASCII diagrams for
every flow, 2400 lines written in 8 minutes, real browser QA, bug
found and fixed. Makes the demo feel real, not abstract.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* docs: rename Contributor Mode to How to Contribute in docs table

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* docs: add Coinbase, Instacart, Rippling to YC bonafides

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* docs: add "one or two people in a garage" to founder story

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* docs: add skill table to top of skills.md with anchor links

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* docs: consolidate — roll contributor-mode into CONTRIBUTING, greptile into skills

- docs/contributor-mode.md → merged into CONTRIBUTING.md (session awareness section)
- docs/greptile.md → merged into docs/skills.md (Greptile integration section)
- Reordered docs table: Skills > Architecture > Browser > Contributing > Changelog

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
a2d756f9 — Garry Tan a month ago
feat: Test Bootstrap + Regression Tests + Coverage Audit (v0.6.0) (#136)

* feat: test bootstrap, regression tests, coverage audit, retro test health

- Add {{TEST_BOOTSTRAP}} resolver to gen-skill-docs.ts
- Add Phase 8e.5 regression test generation to /qa and /qa-design-review
- Add Step 3.4 test coverage audit with quality scoring to /ship
- Add test health tracking to /retro
- Add 2 E2E evals (bootstrap + coverage audit)
- Add 26 validation tests
- Update ARCHITECTURE.md placeholder table
- Add 2 P3 TODOs (CI/CD non-GitHub, auto-upgrade weak tests)

* chore: bump version and changelog (v0.6.0)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* feat: make coverage audit trace actual codepaths, not just syntax patterns

Step 3.4 now instructs Claude to read full files, trace data flow through
every branch, diagram the execution, and check each branch against tests.
Phase 8e.5 regression tests now trace the bug's codepath before writing
the test, catching adjacent edge cases.

* feat: coverage audit now maps user flows, interactions, and error states

Step 3.4 now covers the full picture: code branches AND user-facing behavior.
Maps user flows (complete journey through the feature), interaction edge cases
(double-click, back button, stale state, slow connection), error states
(what does the user actually see?), and boundary states (zero results,
10k results, max-length input). Coverage diagram splits into Code Path
Coverage and User Flow Coverage sections with separate percentages.

* fix: raise test gen cap to 20, add validation tests for user flow coverage

- Raise Step 3.4 test generation cap from 10 to 20 (code + user flow combined)
- Add 3 validation tests: codepath tracing, user flow mapping, diagram sections

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
73b00b4e — Garry Tan a month ago
feat: Review Readiness Dashboard + gstack-slug helper (v0.5.1) (#130)

* feat: add bin/gstack-slug helper + migrate all inline SLUG computation

Extract the opaque SLUG sed pipeline into a shared 5-line shell script.
Replace 8 inline copies across templates with eval $(gstack-slug).
Sanitizes branch names (/ → -) to prevent subdirectory creation.

* feat: review readiness dashboard — track CEO/Eng/Design reviews per branch

Each review skill logs its result to JSONL. A shared {{REVIEW_DASHBOARD}}
placeholder displays run counts, timestamps, and a CLEARED TO SHIP verdict.
/ship pre-flight reads the dashboard and prompts when reviews are missing.

* chore: bump version and changelog (v0.5.1)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
4a77cc2c — Garry Tan a month ago
feat: /plan-design-review + /qa-design-review skills (v0.5.0) (#102)

* feat: add {{DESIGN_METHODOLOGY}} resolver and register design review skills

Add generateDesignMethodology() to gen-skill-docs.ts with 10-category, 80-item
design audit checklist. Register plan-design-review and qa-design-review templates
in findTemplates(). Add both skills to skill-check.ts SKILL_FILES. Add command
and snapshot flag validation tests for both skills in skill-validation.test.ts.

* feat: add /plan-design-review and /qa-design-review skills

/plan-design-review: report-only designer audit with letter grades, AI slop
scoring, structured first impression, design system extraction, DESIGN.md
inference and export offer. Never modifies code.

/qa-design-review: same audit, then iterative fix loop with style(design):
commits, CSS-safe WTF heuristic, before/after screenshots, final re-audit.

* chore: bump version and changelog (v0.5.0)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* docs: update README, ARCHITECTURE for design review skills (v0.5.0)

- Update skill count to 11, add /plan-design-review and /qa-design-review
  to skill table, install/uninstall commands, and demo walkthrough
- Add narrative sections: "senior designer mode" and "designer who codes mode"
  with compelling examples showing AI Slop detection and design system inference
- Add {{DESIGN_METHODOLOGY}} to ARCHITECTURE.md placeholder table
- Extend demo to show full plan→eng→review→ship→qa→design-review pipeline

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* chore: regenerate design review SKILL.md files after merge from main

Picks up BASE_BRANCH_DETECT resolver and updated contributor mode from main.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: add /design-consultation skill — design consultant that creates DESIGN.md

6-phase consultant flow: product context → competitive research (WebSearch) →
complete coherent proposal → drill-downs on demand → font+color preview page →
write DESIGN.md + update CLAUDE.md. Opinionated recommendations grounded in
product context, not menu-driven forms.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* test: add E2E tests for design skill family (7 tests + LLM quality judge)

Tests 1-4: /design-consultation (core flow, research integration, existing
DESIGN.md handling, font+color preview generation).
Tests 5-6: /plan-design-review (audit report, DESIGN.md export).
Test 7: /qa-design-review (audit + fix loop).
LLM judge validates font blacklist compliance, coherence, and AI slop avoidance.
Also adds plan-design-review + qa-design-review to ALL_SKILLS test array.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* chore: mark /design-consultation as shipped in TODOS.md

Renamed from /setup-design-md to reflect the consultant approach.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
a68244ab — Garry Tan a month ago
feat: /document-release skill — post-ship doc updates (v0.4.3) (#109)

* docs: update project documentation for v0.4.2

- README: skill count 9→10, added /document-release to skills table,
  install/uninstall sections, and dedicated section with example
- CHANGELOG: added /document-release bullet to v0.4.2 entry

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: add /document-release skill with smart VERSION handling

New skill runs after /ship but before PR merge. Reads every doc file,
cross-references the diff, auto-updates factual changes, asks about
risky edits. CHANGELOG clobber protection: never uses Write tool on
CHANGELOG.md, only Edit with exact old_string matches.

Smart VERSION logic: instead of silently skipping already-bumped
versions, compares CHANGELOG entry scope against full diff and asks
if significant uncovered changes exist.

Also fixes gstack-upgrade/SKILL.md missing from skill-check.ts
SKILL_FILES array (existing inconsistency with gen-skill-docs.ts).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: /review Step 5.6 — documentation staleness check

Review skill now cross-references code changes against doc files.
If a doc describes a feature that changed but the doc wasn't updated,
flags it as INFORMATIONAL with a pointer to /document-release.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* test: /document-release E2E with CHANGELOG clobber guard

E2E test creates a repo with existing CHANGELOG entries, runs
/document-release, and asserts original entries survive. Critical
guardrail against the incident where an agent replaced CHANGELOG
entries during conflict resolution.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* chore: bump to v0.4.3 — /document-release skill

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* chore: regenerate SKILL.md files after merge

* chore: regenerate SKILL.md files after merge

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
f3ee0ee2 — Garry Tan a month ago
feat: QA restructure, browser ref staleness, eval efficiency metrics (v0.4.0) (#83)

* feat: browser ref staleness detection via async count() validation

resolveRef() now checks element count to detect stale refs after page
mutations (e.g. SPA navigation). RefEntry stores role+name metadata
for better diagnostics. 3 new snapshot tests for staleness detection.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* feat: qa-only skill, qa fix loop, plan-to-QA artifact flow

Add /qa-only (report-only, Edit tool blocked), restructure /qa with
find-fix-verify cycle, add {{QA_METHODOLOGY}} DRY placeholder for
shared methodology. /plan-eng-review now writes test-plan artifacts
to ~/.gstack/projects/<slug>/ for QA consumption.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* feat: eval efficiency metrics — turns, duration, commentary across all surfaces

Add generateCommentary() for natural-language delta interpretation,
per-test turns/duration in comparison and summary output, judgePassed
unit tests, 3 new E2E tests (qa-only, qa fix loop, plan artifact).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* chore: bump version and changelog (v0.4.0)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* docs: update ARCHITECTURE, BROWSER, CONTRIBUTING, README for v0.4.0

- ARCHITECTURE: add ref staleness detection section, update RefEntry type
- BROWSER: add ref staleness paragraph to snapshot system docs
- CONTRIBUTING: update eval tool descriptions with commentary feature
- README: fix missing qa-only in project-local uninstall command

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* docs: add user-facing benefit descriptions to v0.4.0 changelog

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
bb46ca6b — Garry Tan a month ago
feat: smart update check with auto-upgrade, snooze backoff, config CLI (v0.3.9) (#62)

* feat: add bin/gstack-config CLI for reading/writing ~/.gstack/config.yaml

Simple get/set/list interface for persistent gstack configuration.
Used by update-check and upgrade skill for auto_upgrade and update_check settings.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* feat: smart update check with 12h cache, snooze backoff, config disable

- Reduce cache TTL from 24h to 12h for faster update detection
- Add exponential snooze backoff: 24h → 48h → 1 week (resets on new version)
- Add update_check: false config option to disable checks entirely
- Clear snooze file on just-upgraded
- 14 new tests covering snooze levels, expiry, corruption, and config paths

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* feat: upgrade skill with auto-upgrade, 4-option prompt, vendored sync

- Auto-upgrade mode via config or GSTACK_AUTO_UPGRADE=1 env var
- 4-option AskUserQuestion: upgrade once, always, not now, never
- Step 4.5: sync local vendored copy after upgrading primary install
- Snooze write with escalating backoff on "Not now"
- Update preamble text in gen-skill-docs for new upgrade flow
- Regenerate all SKILL.md files

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* docs: simplify upgrade instructions, move auto-upgrade to completed

README now points to /gstack-upgrade instead of long paste commands.
Auto-upgrade TODO moved to Completed section (v0.3.8).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* chore: bump version and changelog (v0.3.9)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
41141007 — Garry Tan a month ago
feat: TODOS-aware skills, 2-tier Greptile replies, gitignore fix (#61)

* fix: log non-ENOENT errors in ensureStateDir() instead of silently swallowing

Replace bare catch {} with ENOENT-only silence. Non-ENOENT errors (EACCES,
ENOSPC) are now logged to .gstack/browse-server.log. Includes test for
permission-denied scenario with chmod 444.

* feat: merge TODO.md + TODOS.md into unified backlog with shared format reference

Merge TODO.md (roadmap) and TODOS.md (near-term) into one file organized by
skill/component with P0-P4 priority ordering and Completed section. Add shared
review/TODOS-format.md for canonical format. Add static validation tests.

* feat: add 2-tier Greptile reply system with escalation detection

Add reply templates (Tier 1 friendly, Tier 2 firm), explicit escalation
detection algorithm, and severity re-ranking guidance to greptile-triage.md.

* feat: cross-skill TODOS awareness + Greptile template refs in all skills

/ship Step 5.5: auto-detect completed TODOs, offer reorganization.
/review Step 5.5: cross-reference PR against open TODOs.
/plan-ceo-review, /plan-eng-review: TODOS context in planning.
/retro: Backlog Health metric. /qa: bug TODO context in diff-aware mode.
All Greptile-aware skills now reference reply templates and escalation detection.

* chore: bump version and changelog (v0.3.8)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* docs: update CONTRIBUTING.md for v0.3.8 changes

Clarify test tier cost table (Tier 3 standalone vs combined), add TODOS.md
to "Things to know", mention Greptile triage in ship workflow description.

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
7d266661 — Garry Tan a month ago
Merge pull request #55 from garrytan/v0.3.6-qa-upgrades

feat: E2E observability + eval infrastructure + all skills templated
a67dae5f — Garry Tan a month ago
fix: update check preamble exits 1 when up to date — convert all skills to .tmpl

The `[ -n "$_UPD" ] && echo "$_UPD"` line in 5 skills was missing `|| true`,
causing exit code 1 when the update check finds no update (empty $_UPD).

Fix: convert ship/, review/, plan-ceo-review/, plan-eng-review/, retro/ to
.tmpl templates using {{UPDATE_CHECK}} placeholder (same as browse/qa/etc).
All 9 skills now generated from templates — preamble changes propagate everywhere.

Also: regenerates qa/SKILL.md which had drifted from its template, adds 12 tests
validating the update check preamble exits 0 in all skills, removes completed TODO.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
ed802d0c — Garry Tan a month ago
feat: eval CLI tools + docs cleanup

Add eval:list, eval:compare, eval:summary CLI scripts for exploring
eval history from ~/.gstack-dev/evals/. eval:compare reuses the shared
comparison functions from eval-store.ts.

- eval:list: sorted table with branch/tier/cost filters
- eval:compare: thin wrapper around compareEvalResults + formatComparison
- eval:summary: aggregate stats, flaky test detection, branch rankings
- Remove unused @anthropic-ai/claude-agent-sdk from devDependencies
- Update CLAUDE.md: streaming docs, eval CLI commands, remove Agent SDK refs
- Add GH Actions eval upload (P2) and web dashboard (P3) to TODOS.md

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
3d750d89 — Garry Tan a month ago
Merge remote-tracking branch 'origin/main' into v0.3.6-qa-upgrades

# Conflicts:
#	test/skill-e2e.test.ts
1717ed28 — Garry Tan a month ago
fix: browse binary discovery broken for agents (v0.3.5) (#44)

* fix: replace find-browse with direct path in SKILL.md setup blocks

Agents were skipping the find-browse binary and guessing bin/browse
(wrong path). Now the setup block explicitly checks browse/dist/browse
with workspace-local priority, global fallback.

Also adds || true to update check to prevent misleading exit code 1.

Adds {{UPDATE_CHECK}} and {{BROWSE_SETUP}} template placeholders to
gen-skill-docs.ts so all skills share a single source of truth.

* refactor: convert qa/ and setup-browser-cookies/ to .tmpl templates

Replaces hardcoded update check and find-browse blocks with
{{UPDATE_CHECK}} and {{BROWSE_SETUP}} placeholders. Both skills
are now generated from templates via gen-skill-docs.

* test: add e2e and LLM eval tests for SKILL.md setup block

- 3 Agent SDK e2e tests: happy path, NEEDS_SETUP, non-git-repo
- LLM eval: setup block clarity + actionability >= 4
- New error pattern: 'no such file or directory.*browse'

These tests catch the exact failure mode where agents can't discover
the browse binary via SKILL.md instructions.

* chore: bump version and changelog (v0.3.5)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Next