# TODOS ## Browse ### Bundle server.ts into compiled binary **What:** Eliminate `resolveServerScript()` fallback chain entirely — bundle server.ts into the compiled browse binary. **Why:** The current fallback chain (check adjacent to cli.ts, check global install) is fragile and caused bugs in v0.3.2. A single compiled binary is simpler and more reliable. **Context:** Bun's `--compile` flag can bundle multiple entry points. The server is currently resolved at runtime via file path lookup. Bundling it removes the resolution step entirely. **Effort:** M **Priority:** P2 **Depends on:** None ### Sessions (isolated browser instances) **What:** Isolated browser instances with separate cookies/storage/history, addressable by name. **Why:** Enables parallel testing of different user roles, A/B test verification, and clean auth state management. **Context:** Requires Playwright browser context isolation. Each session gets its own context with independent cookies/localStorage. Prerequisite for video recording (clean context lifecycle) and auth vault. **Effort:** L **Priority:** P3 ### Video recording **What:** Record browser interactions as video (start/stop controls). **Why:** Video evidence in QA reports and PR bodies. Currently deferred because `recreateContext()` destroys page state. **Context:** Needs sessions for clean context lifecycle. Playwright supports video recording per context. Also needs WebM → GIF conversion for PR embedding. **Effort:** M **Priority:** P3 **Depends on:** Sessions ### v20 encryption format support **What:** AES-256-GCM support for future Chromium cookie DB versions (currently v10). **Why:** Future Chromium versions may change encryption format. Proactive support prevents breakage. **Effort:** S **Priority:** P3 ### State persistence **What:** Save/load cookies + localStorage to JSON files for reproducible test sessions. **Why:** Enables "resume where I left off" for QA sessions and repeatable auth states. **Effort:** M **Priority:** P3 **Depends on:** Sessions ### Auth vault **What:** Encrypted credential storage, referenced by name. LLM never sees passwords. **Why:** Security — currently auth credentials flow through the LLM context. Vault keeps secrets out of the AI's view. **Effort:** L **Priority:** P3 **Depends on:** Sessions, state persistence ### Iframe support **What:** `frame ` and `frame main` commands for cross-frame interaction. **Why:** Many web apps use iframes (embeds, payment forms, ads). Currently invisible to browse. **Effort:** M **Priority:** P4 ### Semantic locators **What:** `find role/label/text/placeholder/testid` with attached actions. **Why:** More resilient element selection than CSS selectors or ref numbers. **Effort:** M **Priority:** P4 ### Device emulation presets **What:** `set device "iPhone 16 Pro"` for mobile/tablet testing. **Why:** Responsive layout testing without manual viewport resizing. **Effort:** S **Priority:** P4 ### Network mocking/routing **What:** Intercept, block, and mock network requests. **Why:** Test error states, loading states, and offline behavior. **Effort:** M **Priority:** P4 ### Download handling **What:** Click-to-download with path control. **Why:** Test file download flows end-to-end. **Effort:** S **Priority:** P4 ### Content safety **What:** `--max-output` truncation, `--allowed-domains` filtering. **Why:** Prevent context window overflow and restrict navigation to safe domains. **Effort:** S **Priority:** P4 ### Streaming (WebSocket live preview) **What:** WebSocket-based live preview for pair browsing sessions. **Why:** Enables real-time collaboration — human watches AI browse. **Effort:** L **Priority:** P4 ### CDP mode **What:** Connect to already-running Chrome/Electron apps via Chrome DevTools Protocol. **Why:** Test production apps, Electron apps, and existing browser sessions without launching new instances. **Effort:** M **Priority:** P4 ### Linux/Windows cookie decryption **What:** GNOME Keyring / kwallet / DPAPI support for non-macOS cookie import. **Why:** Cross-platform cookie import. Currently macOS-only (Keychain). **Effort:** L **Priority:** P4 ## Ship ### Ship log — persistent record of /ship runs **What:** Append structured JSON entry to `.gstack/ship-log.json` at end of every /ship run (version, date, branch, PR URL, review findings, Greptile stats, todos completed, test results). **Why:** /retro has no structured data about shipping velocity. Ship log enables: PRs-per-week trending, review finding rates, Greptile signal over time, test suite growth. **Context:** /retro already reads greptile-history.md — same pattern. Eval persistence (eval-store.ts) shows the JSON append pattern exists in the codebase. ~15 lines in ship template. **Effort:** S **Priority:** P2 **Depends on:** None ### Post-deploy verification (ship + browse) **What:** After push, browse staging/preview URL, screenshot key pages, check console for JS errors, compare staging vs prod via snapshot diff. Include verification screenshots in PR body. STOP if critical errors found. **Why:** Catch deployment-time regressions (JS errors, broken layouts) before merge. **Context:** Requires S3 upload infrastructure for PR screenshots. Pairs with visual PR annotations. **Effort:** L **Priority:** P2 **Depends on:** /setup-gstack-upload, visual PR annotations ### Visual verification with screenshots in PR body **What:** /ship Step 7.5: screenshot key pages after push, embed in PR body. **Why:** Visual evidence in PRs. Reviewers see what changed without deploying locally. **Context:** Part of Phase 3.6. Needs S3 upload for image hosting. **Effort:** M **Priority:** P2 **Depends on:** /setup-gstack-upload ## Review ### Inline PR annotations **What:** /ship and /review post inline review comments at specific file:line locations using `gh api` to create pull request review comments. **Why:** Line-level annotations are more actionable than top-level comments. The PR thread becomes a line-by-line conversation between Greptile, Claude, and human reviewers. **Context:** GitHub supports inline review comments via `gh api repos/$REPO/pulls/$PR/reviews`. Pairs naturally with Phase 3.6 visual annotations. **Effort:** S **Priority:** P2 **Depends on:** None ### Greptile training feedback export **What:** Aggregate greptile-history.md into machine-readable JSON summary of false positive patterns, exportable to the Greptile team for model improvement. **Why:** Closes the feedback loop — Greptile can use FP data to stop making the same mistakes on your codebase. **Context:** Was a P3 Future Idea. Upgraded to P2 now that greptile-history.md data infrastructure exists. The signal data is already being collected; this just makes it exportable. ~40 lines. **Effort:** S **Priority:** P2 **Depends on:** Enough FP data accumulated (10+ entries) ### Visual review with annotated screenshots **What:** /review Step 4.5: browse PR's preview deploy, annotated screenshots of changed pages, compare against production, check responsive layouts, verify accessibility tree. **Why:** Visual diff catches layout regressions that code review misses. **Context:** Part of Phase 3.6. Needs S3 upload for image hosting. **Effort:** M **Priority:** P2 **Depends on:** /setup-gstack-upload ## QA ### QA trend tracking **What:** Compare baseline.json over time, detect regressions across QA runs. **Why:** Spot quality trends — is the app getting better or worse? **Context:** QA already writes structured reports. This adds cross-run comparison. **Effort:** S **Priority:** P2 ### CI/CD QA integration **What:** `/qa` as GitHub Action step, fail PR if health score drops. **Why:** Automated quality gate in CI. Catch regressions before merge. **Effort:** M **Priority:** P2 ### Smart default QA tier **What:** After a few runs, check index.md for user's usual tier pick, skip the AskUserQuestion. **Why:** Reduces friction for repeat users. **Effort:** S **Priority:** P2 ### Accessibility audit mode **What:** `--a11y` flag for focused accessibility testing. **Why:** Dedicated accessibility testing beyond the general QA checklist. **Effort:** S **Priority:** P3 ## Retro ### Deployment health tracking (retro + browse) **What:** Screenshot production state, check perf metrics (page load times), count console errors across key pages, track trends over retro window. **Why:** Retro should include production health alongside code metrics. **Context:** Requires browse integration. Screenshots + metrics fed into retro output. **Effort:** L **Priority:** P3 **Depends on:** Browse sessions ## Infrastructure ### /setup-gstack-upload skill (S3 bucket) **What:** Configure S3 bucket for image hosting. One-time setup for visual PR annotations. **Why:** Prerequisite for visual PR annotations in /ship and /review. **Effort:** M **Priority:** P2 ### gstack-upload helper **What:** `browse/bin/gstack-upload` — upload file to S3, return public URL. **Why:** Shared utility for all skills that need to embed images in PRs. **Effort:** S **Priority:** P2 **Depends on:** /setup-gstack-upload ### WebM to GIF conversion **What:** ffmpeg-based WebM → GIF conversion for video evidence in PRs. **Why:** GitHub PR bodies render GIFs but not WebM. Needed for video recording evidence. **Effort:** S **Priority:** P3 **Depends on:** Video recording ### Deploy-verify skill **What:** Lightweight post-deploy smoke test: hit key URLs, verify 200s, screenshot critical pages, console error check, compare against baseline snapshots. Pass/fail with evidence. **Why:** Fast post-deploy confidence check, separate from full QA. **Effort:** M **Priority:** P2 ### GitHub Actions eval upload **What:** Run eval suite in CI, upload result JSON as artifact, post summary comment on PR. **Why:** CI integration catches quality regressions before merge and provides persistent eval records per PR. **Context:** Requires `ANTHROPIC_API_KEY` in CI secrets. Cost is ~$4/run. Eval persistence system (v0.3.6) writes JSON to `~/.gstack-dev/evals/` — CI would upload as GitHub Actions artifacts and use `eval:compare` to post delta comment. **Effort:** M **Priority:** P2 **Depends on:** Eval persistence (shipped in v0.3.6) ### E2E model pinning **What:** Pin E2E tests to claude-sonnet-4-6 for cost efficiency, add retry:2 for flaky LLM responses. **Why:** Reduce E2E test cost and flakiness. **Effort:** XS **Priority:** P2 ### Eval web dashboard **What:** `bun run eval:dashboard` serves local HTML with charts: cost trending, detection rate, pass/fail history. **Why:** Visual charts better for spotting trends than CLI tools. **Context:** Reads `~/.gstack-dev/evals/*.json`. ~200 lines HTML + chart.js via Bun HTTP server. **Effort:** M **Priority:** P3 **Depends on:** Eval persistence (shipped in v0.3.6) ## Completed ### Phase 1: Foundations (v0.2.0) - Rename to gstack - Restructure to monorepo layout - Setup script for skill symlinks - Snapshot command with ref-based element selection - Snapshot tests **Completed:** v0.2.0 ### Phase 2: Enhanced Browser (v0.2.0) - Annotated screenshots, snapshot diffing, dialog handling, file upload - Cursor-interactive elements, element state checks - CircularBuffer, async buffer flush, health check - Playwright error wrapping, useragent fix - 148 integration tests **Completed:** v0.2.0 ### Phase 3: QA Testing Agent (v0.3.0) - /qa SKILL.md with 6-phase workflow, 3 modes (full/quick/regression) - Issue taxonomy, severity classification, exploration checklist - Report template, health score rubric, framework detection - wait/console/cookie-import commands, find-browse binary **Completed:** v0.3.0 ### Phase 3.5: Browser Cookie Import (v0.3.x) - cookie-import-browser command (Chromium cookie DB decryption) - Cookie picker web UI, /setup-browser-cookies skill - 18 unit tests, browser registry (Comet, Chrome, Arc, Brave, Edge) **Completed:** v0.3.1 ### E2E test cost tracking - Track cumulative API spend, warn if over threshold **Completed:** v0.3.6 ### Auto-upgrade mode + smart update check - Config CLI (`bin/gstack-config`), auto-upgrade via `~/.gstack/config.yaml`, 12h cache TTL, exponential snooze backoff (24h→48h→1wk), "never ask again" option, vendored copy sync on upgrade **Completed:** v0.3.8