Why each phase exists, and how it fails
references/process.md — direkt aus der Skill-Doktrin gerendert.
This is the reasoning behind the gated checklist in SKILL.md. Read it once at the
start of a fresh engagement. SKILL.md says what to do in each phase and where to
stop; this file says why each move earns its place and how it tends to fail when
you skip it. The timeless structure it all enacts — the funnel and its five
contractions, the fork as the unit, the four axes, the shared Scenario — lives in
model.md; this file does not restate it. The two worked engagements (EFI, Skischule)
that show these phases run end to end live in examples.md; this file points at the
moves they demonstrate rather than re-narrating them.
The fundamental move that recurs through every phase: generate candidates →
eliminate via trade-off analysis. Scoring is the discipline that makes elimination
defensible, and it runs continuously across the loop, not just at the end. Each phase
below is one contraction of the funnel (model.md); here is what that contraction buys
you and what breaks when it’s done badly.
Phase 1 — Mission: why it bounds everything
The first contraction. You cannot write good requirements until you know what work the system is for, so Phase 1 names the mission — its purpose, the use cases, and the system that actually exists — stated tool-neutrally, before any requirement is written. It bounds everything after it, which is exactly why a wrong move here is the most expensive: a mis-framed mission silently pre-selects an architecture three phases before the space is supposed to open.
The moves that earn the phase:
- Stakeholder mapping. Who decides, who uses it daily, who is the sceptic. Lead-user comfort is a constraint, not a preference — naming the non-technical lead user here is what later lets approachability sit as a critical criterion rather than a nice-to-have.
- The stress-test use case. One concrete, near-future, high-stakes scenario the
system must serve, named now so every later evaluation can be walked through the same
scenario. It is the spine of the Scenario (
model.md) and the thread that keeps Phase 4 honest. - Use cases, tool-neutrally. State what the system must do (“track follow-ups with
reminders”), never the product category (“a CRM”). The test: can you state it without
naming a product category? If not, reframe it. Naming a tool type here pre-selects the
architecture before Phase 3 opens the space — and that leak carries into the
requirements at every tier. The canonical term is use case / Anwendungsfall, not
“job/Aufgabe” (a label that leaked in from early cockpits); the cockpit ships them as
mission.jsonuseCases[]. - Ground it in the real system. When prior documents are dropped in — specs, interview notes, an earlier AI-drafted requirements doc — treat them as untrusted, possibly contradictory, possibly stale. They drift, they’re incomplete, and AI-generated ones are often confidently wrong. Trust raw sources (transcripts, on-site notes) over synthesized ones, surface every contradiction rather than silently picking a version, and validate each material assumption with the user before it becomes a requirement. Build the picture from what’s confirmed true, marking the rest to-verify.
How it fails: stating the mission in tool categories (“we need a CRM and a ticketing system”), which hands Phase 3 its answer and collapses the space before it was ever examined; or inheriting an AI-drafted doc’s framing as ground truth, so the whole engagement re-derives a confident error.
Phase 2 — Requirements: why it must be self-sufficient
Phase 2 turns the mission into a requirements document well-scoped enough that a
landscape-survey brief can be written from it with no further customer input. That
self-sufficiency is the whole point: the requirements are the columns of a form
(model.md), and the form has to be complete before research goes out to fill it. This
is the step that earns the entire engagement — every downstream phase consumes it — so a
gap here propagates everywhere.
The requirements sort into tiers — hard filters (binary pass/fail, any failure
eliminates), critical preferences (decision-drivers), standard, nice-to-have — and each
becomes a criterion placed on exactly one axis. Functional requirements, non-functional
requirements, risks, and constraints all become criteria; the single-placement rule
(one criterion, one axis) is decided here, the cheapest place to catch a downside
masquerading as a capability. The tiers, the weights, and the requirements-vocabulary →
axis mapping are owned by model.md and contract.md; this phase’s job is to apply them
cleanly and to quote the customer’s stated priorities verbatim so they carry forward
without drift.
Two moves matter beyond filling the tiers:
- Name what is not a requirement. Saying “EU data residency is not required” out loud eliminates an entire false-disqualification axis before research ever runs.
- Sketch the demand picture, don’t pin it. A rough Mengengerüst — prose plus target amounts, no schema, no named knobs — captures the shape of demand. The actual Scenario parameters get pinned in the Phase-4 pricing pass, once candidate solutions reveal which cost shape each needs. Committing to seats/peaks/toggles now imports a specific (usually SaaS-shaped) cost model before any solution exists to cost.
How it fails: mistaking a preference for a hard filter (or asserting a hard filter
that was never actually a constraint for this user), which silently disqualifies good
candidates; preferences that never differentiate (every tool scores 4–5) and so are dead
weight — sharpen the rubric or remove them; a compound filter that bundles several
requirements into one (“must be their whole web presence” hiding discovery + trust +
booking), which lets a tool pass by clearing only one; and carrying parallel
“non-functional” or “risks” lists alongside the criteria, which is exactly the
bookkeeping single-placement replaces. The conversational instrument that elicits all of
this — the recurring-requirements checklist, the holistic-then-criterion-by-criterion
sign-off — is owned by requirements-interview.md.
Phase 3 — Architecture: why it precedes the market
Phase 3 partitions the solution space into a few coherent architectures and surfaces the
forks (decision points) whose positions generate them — the structural findings that
determine which architectures are viable before any market data is gathered. It runs in
two movements: the systems analysis that yields the forks, then the candidate architectures
drawn from them. (For a pure single-tool pick the forks are light or absent; the phase
shrinks accordingly.) The fork-as-unit, the resolution modes, and the
architecture-is-not-a-solution rule are owned by model.md; here is why each output is
worth producing.
- The coupling finding is the most valuable output. It tells you which entities must share relational state for the core workflows to function. If a workflow must traverse several related entities in one query, a solution that stores them across separate tools with async sync is structurally fragile — it breaks lookup, prefill, and triggers at the seams even if each individual tool is excellent. Naming this eliminates architecture classes, not just tools, before the landscape survey runs. Name which entities form the spine (must share one store) and which are loose.
- Key workflows. The 2–4 highest-stakes flows through the domain, including the Phase-1 stress-test scenario, with the data traversed at each step and the human-in-the-loop / AI-insertion points. These double as the verification guide for the next client meeting: every unverified assumption is a question to ask.
- The decision forks are the durable output. Name the degrees of freedom whose
positions generate the architectures — who owns the system of record, whether a
knowledge layer is native or separate, whether a core component is rented or owned. Each
fork carries a resolution mode that names which later phase closes it (fact at 2, rating
at 4, value at 5) and a leverage (how much it narrows). Forks are discovered throughout
the engagement — Phase 4 surfaces more — but here is where you first go looking; place
each later-found one at the phase that resolves it. Discovery ≠ placement (the rule
is
model.md’s; the consequence is this phase’s: architecture leads with the architectures and names the forks, but does not parade their interactive lever — the space must come before the lever that reshapes it).
The second movement enumerates the architectures — tool-agnostic forms, each a coherent path through the forks, drawn before any market scan. An architecture is not yet a solution: the loop (Phase 4) fills it with concrete tools + build + glue and scores it into a solution; a custom-built piece is a first-class component there, not a fallback. Include anchor candidates — architectures expected to fail, kept to verify a structural finding rather than to compete (the fragmented best-of-breed to test whether your coupling analysis really rules it out; the full custom build to mark the build ceiling). Mark them as anchors so they don’t distort the shortlist; the elimination itself is evidence. Enumerate the architectures fresh for this engagement, fold in the researcher’s hunch as one branch but generate the alternatives alongside it, and note which you expect to fall and why — how a composition fails is often the most useful structural finding.
How it fails: using the phase to relabel Phase 2 criteria as component names (“follow-up management” → “Task/Reminder Module”) — a 1:1 mapping of use cases to modules is circular, it adds no topology the criteria didn’t already have. If the output could have been produced by scanning the criteria linearly, the phase hasn’t done its job. The deeper failure is skipping the phase: done implicitly, an unexamined architectural assumption becomes the wall you hit on iteration three, and broadening turns into a panic rescue instead of a planned move. Label everything HYPOTHESIS until the client meeting verifies it, and keep structural findings falsifiable (“a fragmented spine will break relational lookup” is testable; “complex systems need tight integration” is not).
Phase 4 — Solutions: why the loop, and how it narrows
The core of the engagement — where architectures get filled with tools and scored into solutions
(model.md: the composition is the scored unit), and where further forks surface (a
cost or capability found here can open a decision Phase 3 couldn’t see). One iteration is
landscape survey → (deep dive + red team) → clarification when needed; it exits with
either a clear winner (→ Decisions) or no winner (→ broaden, → another iteration). Most
engagements need only one iteration. Each stage is a separate turn because each is a
research hand-off (the hand-off discipline and the brief skeletons are owned by
research-briefs.md); the loop narrows a shortlist, it does not crown a winner early.
Narrowing is a palette of moves, not a fixed pipeline. Survey → deep dive → red team
names the activities; it does not prescribe their order once the long list exists. After
the survey you hold a palette of investigative moves — specify a candidate’s composition
(its coverage), estimate its cost, estimate its build/integration effort, assess its risk,
assess its fit — and at each step you run the one with the most leverage right now, the
one that most reduces uncertainty about which solutions are real. Some candidates fall away
by a failed gate, some by a consultant judgment call; there is no canonical order
(sometimes one risk finding kills three candidates before any costing is worth doing). The
mechanics of the moves and fork-discovery during narrowing are in model.md § Narrowing.
The loop is operational: the gaps are the work-list. Build the long list per
architecture and initialise each candidate empty — all five facets offen, no scores
trusted yet. Which move to run next is then not a matter of feel: it is the open facet whose
closing would most separate the live candidates, read off the knowledge grid by
information gain (model.md § The knowledge grid). So the iteration is concretely init
empty → fill the highest-gain open cell → re-rank → collapse, repeated until the survivors
differ only on a handful of forks (the collapse shape Phase 5 wants). A gap that turns out to
hide a value-fork becomes a 🎯 client move, not a research one. The Arbeitsbrett (the
per-architecture board) and the Schärfe move-rail that render this grid are cockpit.md; the
infoGain compute stays template-level until it generalizes across engagements.
Bookend every research workflow with sign-off, and keep the raw. Before a dynamic research
workflow runs, present its plan — its structure and how many agents it fans out to — for
approval; when it returns, present the results for approval before any write-back (the
bookend doctrine is model.md § The sign-off bookend; the brief hand-off and the ingest/audit
procedure are research-briefs.md). Write-back may move scores, knowledge-states, cost, and
findings but never gates or authored verdict prose, and a tool research found weaker
becomes a red-team caveat + lower score, never a fabricated gate-0. Persist every round’s
raw output verbatim to research/ the instant it returns (it often sits in ephemeral
scratch that’s lost on recycle) — the structured matrix and the curated findings are
distillations, not replacements. Any throwaway script written to wrangle a round is scratch:
keep the raw research, discard the scripts.
The stages, and what each is for
- Landscape survey. From a universe of 30–50+ candidates per architecture branch, produce a shortlist of 5–10 that pass the hard filters, plus a structural finding about the category, plus the explicit list of who was eliminated and on which filter. A branch’s core is a family of interchangeable products, not one tool — the shortlist is that family catalogued. No ranking, no winner, no preference scores — that’s the deep dive. In a solution-design engagement the survey’s real yield is architectural: the per-category finding feeds back to the Phase-3 forks and architectures, confirming, weakening, killing — or revealing a position the Phase-3 set missed (a market gap is itself a finding). The architecture set is provisional both ways: a survey can demote a branch and surface a new one. A position the survey adds or promotes earns a first-class survey at the same depth — never carried forward on the incidental mentions from the brief that happened to find it. And spend survey depth in proportion to a branch’s current live-ness, not its round-1 prominence: the first pass is a bet, so keep it cheap enough to re-bet, then rebalance depth toward the branches that survived. Closing the architecture funnel is the point; the narrowed tool list is a by-product. (Ehimare: R1 spent its deepest survey on the broker-CRM branch R2 then demoted, while the generic-CRM position that became the favourite was never enumerated in Phase 3 and stayed under-surveyed for two rounds.)
- Deep dive. Head-to-head feature evaluation of the shortlist against the stress-test use case, walking every candidate through the same scenarios. It populates the Fit axis and picks the paper-winner.
- Red team — run in parallel with the deep dive, not after. Framed as “what does feature-comparison miss?”: recent negative reviews (last 12–18 months), acquisition or leadership turmoil, pricing traps (geographic exclusions, tier paywalls hiding required features), hidden disqualifiers, stability signals, real user pain in forums. Most of this lands on the Risk axis. The deep dive picks the paper-winner; the red team determines whether the paper-winner survives contact with reality — and the decisive findings often come from here, not the feature comparison.
- Roll tools up to solutions. Tool scores are inputs; the unit you rank is the solution (composition). After the deep dive, assess each candidate solution as a whole — its Fit (does the composition cover the use cases, including the custom piece?), its Risk (integration fragility and vendor-count/dependency risk grow with every tool and every line of glue), its Cost. A two-tool stack can out-fit any single platform and still lose on integration risk; that trade-off only shows up at the solution level, so score it there.
- Pricing / TCO runs as its own pass after scoring. You need costs before you can
recommend, but cost must never contaminate the Fit score — so pricing runs after Fit
is fixed, as a separate axis. The 4-bucket method and its reasoning are owned by
pricing-tco.md; the schema fields bycontract.md. The protected rule lives here too: Fit is scored with no knowledge of price. - Clarification — only when there’s no clear winner (two or three finalists within ~5% on the matrix, and the 1–5 scale can’t separate them). Re-score all the tied finalists symmetrically on the full rubric — including any criterion discovered since Phase 2, back-scored across every candidate — rather than building the case for the current leader. Run the head-to-head on the gap (one persona, one workflow), but let the symmetric re-score, not a quiet qualitative tiebreak, move the order. Some of what separates finalists is genuinely experiential (which UI feels right) and belongs to the user trying them, not to your matrix; surface those rather than scoring them. Never re-litigate settled questions.
Eliminations are recorded, never deleted. Every eliminated candidate stays in the
record with the reason it fell — in the matrix as a hard-filter 0, or in the loop log.
Eliminations get revisited when criteria shift, so a struck-through candidate is data, not
absence — it’s the first place a later round looks. This governs the research output
too: when a landscape report hands you a ranked list or a named winner, discard the
report ranking and treat the shortlist as unranked until the deep dive scores it (the
report’s job is to find and filter, not to decide; the ingest/audit procedure that enforces
this is owned by research-briefs.md).
How it fails: doing several stages in one breath (survey + score + recommend), which produces confident mush — the failure mode the whole skill exists to prevent; anointing a #1 in the first pass instead of leaving a 3–5 candidate shortlist; letting the research report’s ranking stand; or scoring tools instead of solutions, so the integration/vendor- count trade-off never surfaces.
Stamp what moved, flag what now rests on it (the judgement hierarchy)
A Phase-4 finding doesn’t just update a score — it can move a standing judgement one or
two layers up (model.md § The judgement hierarchy). So when a finding lands and
materially changes a child — a gate flip, an elimination, a score crossing a
tier boundary, or a new finding tagged to a dependency:
- Stamp the child’s
revisedInwith the current round — on the touched solution (solutions*.json) and on the finding (rounds[].findings[].revisedIn,research-index.json). Cosmetic edits don’t get stamped (or the badge cries wolf). - Any architecture or comparison verdict whose
verdict.basis[]names that child now reads stale — the cockpit shows “⚠ neu zu prüfen — <dep> hat sich in Runde N geändert” (kernelstaleness()detects it; nothing is rewritten). - Re-examine and re-stamp deliberately. A human/agent re-forms the standing verdict and
bumps its
verdict.asOfto the current round (clearing the badge), or confirms it still holds and re-stamps anyway. The badge is a prompt to look, never an auto-edit.
This is the in-phase, append-not-rewind half of regression; the distinct escalation half (a finding that invalidates a requirement) is below.
Phase 5 — Decisions: why 2–3 solutions, not one
Collapse the accumulated research into a recommendation the customer can act on, and
close the open forks: the 🎯 value-forks return to the client here (📊 rating-forks
were settled by Phase 4’s scoring, 🔍 fact-forks by a Phase-2 datum). Don’t present one
answer — present 2–3 solutions that lead to different outcomes (typically a recommended
path, a compromise path, and a minimal-change path), each with its composition, an
architecture sketch, the 4-bucket TCO band (not a single sticker price — pricing-tco.md),
implementation path, what it optimises for, what it trades off. Present the comparison
across the four axes — Fit and Risk scored, Cost as a band whose width reads as cost risk —
never folded into one blended number.
The shape to aim for is 2–3 solutions + 2–3 open decisions, reached by collapsing the
survivors once they differ only on a handful of forks (their fork positions, model.md):
siblings that vary on a single decision become “one solution family + that open decision,”
which is exactly the briefing shape. If too many axes still separate them to present
cleanly, that’s the signal to run another move or another round — not to force the pick.
The instrument informs; the consultant authors the verdict (P1 below). The weighted ranking and its named weighting presets are a transparent, turnable instrument — show the client your weighting and let them watch the order move — but the verdict is yours to write, framed against the customer’s stated values (the verbatim Phase-2 quotes), risks named not glossed, and free to diverge from whatever currently sits on top; when it does, say so and say why. The old rule “never blend into one score” survives in its true form: never let an opaque blend stand in for your judgment.
How it fails: presenting one answer as if research dictated it (it informs, it doesn’t dictate); folding the four axes into a single ranked number that hides the trade-off; forcing a pick when the survivors still differ on too many axes; or letting the instrument’s top row be the verdict instead of authoring one.
Regression — absorbing change without rewinding
The funnel is re-entrant. Priorities shift, a new requirement surfaces, a stakeholder reframes the mission — and the method must take it in without discarding what it learned. The discipline:
- Append, don’t rewind. A new investigation is a new research round, logged in
research-index.jsonrounds[](what it asked, what it found, how it moved the picture). The log only grows; you never rewrite an old round. - Update the judgments in place. A new criterion is added to
criteria.jsonand back-scored across every surviving candidate — an unscored criterion that never touches the totals is a rescore that changes nothing. A shifted priority is a re-weight. Eliminations are recorded as a hard-filter0, never deleted. - The live surfaces show the latest; the Report is the history. Every live surface renders
current truth — the Reise stations the current mission/criteria/architecture, the Exploration
screens (Arbeitsbrett, Verdikt, Cockpit) the current scores, shortlist, and fork positions. The
Report screen is the phase-by-phase log of how the engagement learned — history lives in the
Report, current truth lives in the live surfaces. (The
rounds[]data history and the optionalloop-log.mdprose nav aid are distinct granularities —contract.md.) - Bubble the change up the judgement hierarchy. A regression that moves a child stamps its
revisedIn, which flags every standing architecture/comparison verdict resting on it (basis[]) as neu zu prüfen — the soft, in-phase signal that a higher-layer judgement may no longer hold (the stamp-and-flag protocol is in § Phase 4 above; the detector is kernelstaleness()). This is detection, not re-derivation: the human re-examines and re-stampsverdict.asOf. Footgun: if you rename a solutionkeyor findingid, fix everybasis[]that referenced it — an orphaned ref silently stops matching and the verdict looks fresh forever (contract/validate.mjswarns). - A research-initiated requirement change is escalated, not silently absorbed. Re-deriving forward applies when a later-phase finding moves a fork, a rating, or a candidate. But when a finding invalidates or materially changes an earlier-phase requirement (Phase 1 use case or Phase 2 criterion) — e.g. a survey shows a stated ambition is unbuildable or carries unacceptable liability — STOP and escalate to the consultant/user before re-deriving. A requirement change is theirs to confirm, exactly like a value-fork; the agent may propose it (with the finding that forced it) but never quietly rewrite the requirement and recompute. (Ehimare R2: a survey finding that customer-facing AVB-citing carries hallucination liability changed the requirement; it was reframed as an internal-only helper — a requirement move that needed sign-off, not an absorb.)
So a regression doesn’t reset the engagement to an earlier phase; it adds material at whatever phase the change touches (a new requirement re-enters Phase 2, a new architecture re-enters Phase 3) and re-derives forward. Because the phases are data files, not code branches, re-entering one is editing that file and letting the rest recompute.
Broadening is the within-Phase-4 special case of regression. When an iteration exits without a winner — no architecture produced an acceptable candidate, or every finalist dies on the same dimension on the same axis — the architectural assumption is wrong, not the candidates. “Stuck is information.” Broaden along three streams: gap-check (did a hard filter over-eliminate?), adjacent category (a market not surveyed?), alternate architecture (invert the assumption). Propose the direction, stop for sign-off, then start the next iteration with a fresh landscape survey on the broadened scope. Broadening is triggered by a dead loop and stays inside Phase 4; regression is the general case, triggered by changed inputs at any phase. Same mechanic (append, re-derive), different trigger.
The five Principles
- Architecture before tools. The first decision is the shape of the stack, not the products. Phase 3 exists to make that decision before the market scan can bias it.
- Generate then eliminate. Every iteration of the loop produces candidates and eliminates via trade-off analysis. Scoring is what makes the elimination defensible.
- Stuck is information. When the loop hits a wall, the architectural assumption is wrong — broaden, don’t grind on the same path.
- Stakeholder comfort is a decisive criterion. Tool approachability for the actual lead users is weighted alongside power-user features, not below them.
- Scoring drift is signal. When the quantitative ranking and the qualitative read diverge — the scores rank A first but the verdict prose names B — investigate the underlying assumptions before reconciling. Don’t silently smooth it over.
The phase-gating discipline that holds across all five
After every phase: STOP, present what you produced, state the decision the human needs to
make, and wait for sign-off before the next. This is not a single-shot task — it is a
staged engagement across many turns and usually several chats, and the single most common
failure mode is an agent doing several phases at once. The stop is the discipline that makes
the output trustworthy; if you ever notice yourself about to “also go ahead and…,” that’s
the failure mode, and you stop there. (The same rule applied inside a step is P1’s verdict
doctrine, owned by model.md: the consultant authors the verdict, the agent presents the
picture and the defects it found, and never eliminates a candidate or names a winner on its
own.)