~cytrogen/gstack (9c5f479745acc90533a7ff75a00771b9056c43ef): docs/designs/SELF_LEARNING_V0.md

#Design: GStack Self-Learning Infrastructure

Generated by /office-hours + /plan-ceo-review + /plan-eng-review on 2026-03-28 Updated: 2026-04-01 (post-Session Intelligence, reviewed by Codex) Branch: garrytan/ce-features Repo: gstack Status: ACTIVE Mode: Open Source / Community

#Problem Statement

GStack runs 30+ skills across sessions but learns nothing between them. A /review session catches an N+1 query pattern, and the next /review on the same codebase starts from scratch. A /ship run discovers the test command, and every future /ship re-discovers it. A /investigate finds a tricky race condition, and no future session knows about it.

Every AI coding tool has this problem. Cursor has per-user memory. Claude Code has CLAUDE.md. Windsurf has persistent context. But none of them compound. None of them structure what they learn. None of them share knowledge across skills.

#What We're Building

Per-project institutional knowledge that compounds across sessions and skills. Structured, typed, confidence-scored learnings that every gstack skill can read and write. The goal: after 20 sessions on the same codebase, gstack knows every architectural decision, every past bug pattern, and every time it was wrong.

#North Star

/autoship (Release 5). A full engineering team in one command. Describe a feature, approve the plan, everything else is automatic. /autoship can't work without learnings (R1), review quality (R2), session persistence (R3), and adaptive ceremony (R4). Releases 1-4 are the infrastructure that makes /autoship actually work.

#Audience

YC founders building with AI. The people who run gstack on real codebases 20+ times a week and notice when it asks the same question twice.

#Differentiation

Tool	Memory model	Scope	Structure
Cursor	Per-user chat memory	Per-session	Unstructured
CLAUDE.md	Static file	Per-project	Manual
Windsurf	Persistent context	Per-session	Unstructured
GStack	Per-project JSONL	Cross-session, cross-skill	Typed, scored, decaying

#State Systems

gstack has four distinct persistence layers. They share storage patterns (JSONL in ~/.gstack/projects/$SLUG/) but serve different purposes:

System	File	What it stores	Written by	Read by
Learnings	`learnings.jsonl`	Institutional knowledge (pitfalls, patterns, preferences)	All skills	All skills (preamble)
Timeline	`timeline.jsonl`	Event history (skill start/complete, branch, outcome)	Preamble (automatic)	/retro, preamble context recovery
Checkpoints	`checkpoints/*.md`	Working state snapshots (decisions, remaining work, files)	/checkpoint, /ship, /investigate	Preamble context recovery, /checkpoint resume
Health	`health-history.jsonl`	Code quality scores over time (per-tool, composite)	/health	/retro, /ship (gate), /health (trends)

These are not overlapping. Learnings = what you know. Timeline = what happened. Checkpoints = where you are. Health = how good the code is. Each answers a different question.

#Release Roadmap

#Release 1: "GStack Learns" (v0.13-0.14) — SHIPPED

Headline: Every session makes the next one smarter.

What shipped:

Learnings persistence at ~/.gstack/projects/{slug}/learnings.jsonl
/learn skill for manual review, search, prune, export
Confidence calibration on all review findings (1-10 scores with display rules)
Confidence decay for observed/inferred learnings (1pt/30d)
Cross-project learnings discovery (opt-in, AskUserQuestion consent)
"Learning applied" callouts when reviews match past learnings
Integration into /review, /ship, /plan-*, /office-hours, /investigate, /retro

Schema:

{
  "ts": "2026-03-28T12:00:00Z",
  "skill": "review",
  "type": "pitfall",
  "key": "n-plus-one-activerecord",
  "insight": "Always check includes() for has_many in list endpoints",
  "confidence": 8,
  "source": "observed",
  "branch": "feature-x",
  "commit": "abc1234",
  "files": ["app/models/user.rb"]
}

Architecture: append-only JSONL. Duplicates resolved at read time ("latest winner" per key+type). No write-time mutation, no race conditions.

#Release 2: "Review Army" (v0.14.3-0.14.4) — SHIPPED

Headline: 10 specialist reviewers on every PR.

What shipped:

7 parallel specialist subagents: always-on (testing, maintainability) + conditional (security, performance, data-migration, API contract, design) + red team (large diffs / critical findings)
JSON-structured findings with confidence scores + fingerprint dedup across agents
PR quality score (0-10) logged per review + /retro trending
Learning-informed specialist prompts, past pitfalls injected per domain
Multi-specialist consensus highlighting, confirmed findings get boosted
Enhanced Delivery Integrity via PLAN_COMPLETION_AUDIT
Checklist refactored: CRITICAL categories stay in main pass, specialist categories extracted to focused checklists in review/specialists/

#Release 2.5: "Review Army Expansions" — NOT YET SHIPPED

Headline: Ship after R2 proves stable. Check in on how the core loop is performing.

Pre-check: review R2 quality metrics (PR quality scores, specialist hit rates, false positive rates, E2E test stability). If core loop has issues, fix those first.

What ships:

E1: Adaptive specialist gating, auto-skip specialists with 0-finding track record. Store per-project hit rates via gstack-learnings-log. User can force with --security etc.
E3: Test stub generation, each specialist outputs TEST_STUB alongside findings. Framework detected from project (Jest/Vitest/RSpec/pytest/Go test). Flows into Fix-First: AUTO-FIX applies fix + creates test file.
E5: Cross-review finding dedup, read gstack-review-read for prior review entries. Suppress findings matching a prior user-skipped finding.
E7: Specialist performance tracking, log per-specialist metrics via gstack-review-log. Timeline integration: specialist runs appear in timeline.jsonl for /retro trending.

#Release 3: "Session Intelligence" (v0.15.0) — SHIPPED

Headline: Your AI sessions remember what happened.

What shipped:

Session timeline: every skill auto-logs start/complete events to ~/.gstack/projects/$SLUG/timeline.jsonl. Local-only, never sent anywhere, always on regardless of telemetry setting.
Context recovery: after compaction or session start, preamble lists recent CEO plans, checkpoints, and reviews. Agent reads the most recent to recover context.
Cross-session injection: preamble prints LAST_SESSION and LATEST_CHECKPOINT for the current branch. You see where you left off before typing anything.
Predictive skill suggestion: if your last 3 sessions follow a pattern (review, ship, review), gstack suggests what you probably want next.
"Welcome back" synthesized context message on session start.
/checkpoint skill: save/resume/list working state snapshots. Cross-branch listing for Conductor workspace handoff between agents.
/health skill: code quality scorekeeper wrapping project tools (tsc, biome, knip, shellcheck, tests). Composite 0-10 score, trend tracking, improvement suggestions when scores drop.
Timeline binaries: bin/gstack-timeline-log and bin/gstack-timeline-read.
Routing rules: /checkpoint and /health added to preamble skill routing.

Design doc: docs/designs/SESSION_INTELLIGENCE.md

#Release 4: "Adaptive Ceremony" — NOT YET SHIPPED

Headline: GStack respects your time without compromising your safety.

Ceremony and trust are separate concerns. Ceremony = the set of review/test/QA steps a PR goes through. Trust = a policy engine that determines which ceremony level applies. They interact but don't merge.

What ships:

Ceremony levels:

FULL: all specialists, adversarial, Codex structured review, coverage audit, plan completion. For large diffs, new features, migrations, auth changes.
STANDARD: adversarial + Codex, coverage audit, plan completion. For medium diffs, typical feature work.
FAST: adversarial only. For small, well-tested changes on trusted projects.

Trust policy engine:

Scope-aware trust. Trust is earned per change class, not globally. Clean history on docs-only PRs does not buy trust on migration PRs.
Change class detection: docs, tests, config, frontend, backend, migrations, auth, infra. Each class has its own trust threshold.
Trust signals: consecutive clean reviews (per class), /health score stability, regression frequency, test coverage trends.
Trust never fast-tracks: migrations, auth/permission changes, new API endpoints, infrastructure changes. These always get FULL ceremony regardless of trust level.
Gradual degradation, not binary reset. A single regression doesn't reset all trust. It degrades trust for that change class by one level.

Scope assessment:

TINY/SMALL/MEDIUM/LARGE classification in /review, /ship, /autoplan based on diff size, files touched, and change class.
Ceremony level = f(scope, trust, change class).

TODO lifecycle:

/triage for interactive approval of incoming TODOs
/resolve for batch resolution via parallel agents

#Release 5: "/autoship — One Command, Full Feature" — NOT YET SHIPPED

Headline: Describe a feature. Approve the plan. Everything else is automatic.

/autoship is a resumable state machine, not a linear pipeline. Review and QA can send work back to build/fix. Compaction can interrupt any phase. The system must recover gracefully.

                    ┌──────────┐
                    │  START   │
                    └────┬─────┘
                         │
                    ┌────▼─────┐
                    │ /office- │
                    │  hours   │
                    └────┬─────┘
                         │
                    ┌────▼─────┐
                    │/autoplan │ ◄── single approval gate
                    └────┬─────┘
                         │
              ┌──────────▼──────────┐
              │       BUILD         │ ◄── /checkpoint auto-save
              └──────────┬──────────┘
                         │
              ┌──────────▼──────────┐
              │      /health        │ ◄── quality gate
              │   (score >= 7.0)    │
              └──────────┬──────────┘
                         │ fail → back to BUILD
              ┌──────────▼──────────┐
              │      /review        │
              └──────────┬──────────┘
                         │ ASK items → back to BUILD
              ┌──────────▼──────────┐
              │        /qa          │
              └──────────┬──────────┘
                         │ bugs found → back to BUILD
              ┌──────────▼──────────┐
              │       /ship         │
              └──────────┬──────────┘
                         │
              ┌──────────▼──────────┐
              │ /checkpoint archive │ ◄── preserve, don't destroy
              └─────────────────────┘

What ships:

/autoship autonomous pipeline with the state machine above. Each phase writes to timeline.jsonl. Checkpoints auto-save before each phase. Compaction recovery: context recovery reads checkpoint + timeline, resumes at the last completed phase.
Checkpoint archival on completion (not deletion). Recovery state is preserved for debugging failed autoship runs.
/ideate brainstorming skill (parallel divergent agents + adversarial filtering)
Research agents in /plan-eng-review (codebase analyst, history analyst, best practices researcher, learnings researcher)

Depends on: R1 (learnings for research agents), R2 (review army for quality), R3 (session intelligence for persistence), R4 (adaptive ceremony for speed).

#Release 6: "Execution Studio" — NOT YET SHIPPED

Headline: Parallel execution infrastructure.

What ships:

Swarm orchestration: multi-worktree parallel builds. Builds on Conductor workspace handoff from /checkpoint (R3). An orchestrator skill dispatches independent workstreams to parallel agents, each with its own worktree.
Codex build delegation: auto-detect when to delegate implementation to Codex CLI based on task type (boilerplate, test generation, mechanical refactors).
PR feedback resolution: parallel comment resolver across review platforms.
/onboard: auto-generated contributor guide from codebase analysis.
/triage-prs: batch PR triage for maintainers.

#Release 7: "Design & Media" — NOT YET SHIPPED

Headline: Visual design integration.

What ships:

Figma design sync (pixel-matching iteration loop)
Feature video recording (auto-generated PR demos)
Cross-platform portability (Copilot, Kiro, Windsurf output)

#Risk Register

#Proxy signals as permission to skip scrutiny

(Identified by Codex review, 2026-04-01)

/health scores, clean review history, and timeline patterns are useful signals. They are not proof of safety. If those signals feed ceremony reduction AND /autoship, the failure mode is rare, silent, high-severity mistakes. Mitigations:

Certain change classes never fast-track (migrations, auth, infra, new endpoints).
Trust degrades gradually, not binary reset.
/autoship always runs FULL ceremony on its first run per project. Trust is earned.

#Stale context recovery

(Identified by Codex review, 2026-04-01)

Context recovery can inject wrong-branch state, obsolete plans, or invalid checkpoints. Mitigations:

Checkpoints include branch name in YAML frontmatter. Context recovery filters by current branch.
Timeline grep filters by branch before showing LAST_SESSION.
Stale artifact detection: if checkpoint is >7 days old, note it as potentially stale rather than presenting as current.

#Validation metrics needed

(Identified by Codex review, 2026-04-01)

Before shipping R4 (Adaptive Ceremony), measure:

Predictive suggestion accuracy (did the user run the suggested skill?)
Trust policy false-skip rate (did fast-tracked PRs have post-merge issues?)
Context recovery accuracy (did recovered context match actual state?)
/health score correlation with actual code quality (do high scores predict fewer production bugs?)

These metrics should be collected during R3 usage and reviewed before R4 ships.

#Acknowledged Inspiration

The self-learning roadmap was inspired by ideas from the Compound Engineering project by Nico Bailon. Their exploration of learnings persistence, parallel review agents, and autonomous pipelines catalyzed the design of GStack's approach. We adapted every concept to fit GStack's template system, voice, and architecture rather than porting directly.