Methodik

Why each phase exists, and how it fails

references/process.md — direkt aus der Skill-Doktrin gerendert.

This is the reasoning behind the gated checklist in SKILL.md. Read it once at the start of a fresh engagement. SKILL.md says what to do in each phase and where to stop; this file says why each move earns its place and how it tends to fail when you skip it. The timeless structure it all enacts — the funnel and its five contractions, the fork as the unit, the four axes, the shared Scenario — lives in model.md; this file does not restate it. The two worked engagements (EFI, Skischule) that show these phases run end to end live in examples.md; this file points at the moves they demonstrate rather than re-narrating them.

The fundamental move that recurs through every phase: generate candidates → eliminate via trade-off analysis. Scoring is the discipline that makes elimination defensible, and it runs continuously across the loop, not just at the end. Each phase below is one contraction of the funnel (model.md); here is what that contraction buys you and what breaks when it’s done badly.

Phase 1 — Mission: why it bounds everything

The first contraction. You cannot write good requirements until you know what work the system is for, so Phase 1 names the mission — its purpose, the use cases, and the system that actually exists — stated tool-neutrally, before any requirement is written. It bounds everything after it, which is exactly why a wrong move here is the most expensive: a mis-framed mission silently pre-selects an architecture three phases before the space is supposed to open.

The moves that earn the phase:

Stakeholder mapping. Who decides, who uses it daily, who is the sceptic. Lead-user comfort is a constraint, not a preference — naming the non-technical lead user here is what later lets approachability sit as a critical criterion rather than a nice-to-have.
The stress-test use case. One concrete, near-future, high-stakes scenario the system must serve, named now so every later evaluation can be walked through the same scenario. It is the spine of the Scenario (model.md) and the thread that keeps Phase 4 honest.
Use cases, tool-neutrally. State what the system must do (“track follow-ups with reminders”), never the product category (“a CRM”). The test: can you state it without naming a product category? If not, reframe it. Naming a tool type here pre-selects the architecture before Phase 3 opens the space — and that leak carries into the requirements at every tier. The canonical term is use case / Anwendungsfall, not “job/Aufgabe” (a label that leaked in from early cockpits); the cockpit ships them as mission.json useCases[].
Ground it in the real system. When prior documents are dropped in — specs, interview notes, an earlier AI-drafted requirements doc — treat them as untrusted, possibly contradictory, possibly stale. They drift, they’re incomplete, and AI-generated ones are often confidently wrong. Trust raw sources (transcripts, on-site notes) over synthesized ones, surface every contradiction rather than silently picking a version, and validate each material assumption with the user before it becomes a requirement. Build the picture from what’s confirmed true, marking the rest to-verify.

How it fails: stating the mission in tool categories (“we need a CRM and a ticketing system”), which hands Phase 3 its answer and collapses the space before it was ever examined; or inheriting an AI-drafted doc’s framing as ground truth, so the whole engagement re-derives a confident error.

Phase 2 — Requirements: why it must be self-sufficient

Phase 2 turns the mission into a requirements document well-scoped enough that a landscape-survey brief can be written from it with no further customer input. That self-sufficiency is the whole point: the requirements are the columns of a form (model.md), and the form has to be complete before research goes out to fill it. This is the step that earns the entire engagement — every downstream phase consumes it — so a gap here propagates everywhere.

The requirements sort into tiers — hard filters (binary pass/fail, any failure eliminates), critical preferences (decision-drivers), standard, nice-to-have — and each becomes a criterion placed on exactly one axis. Functional requirements, non-functional requirements, risks, and constraints all become criteria; the single-placement rule (one criterion, one axis) is decided here, the cheapest place to catch a downside masquerading as a capability. The tiers, the weights, and the requirements-vocabulary → axis mapping are owned by model.md and contract.md; this phase’s job is to apply them cleanly and to quote the customer’s stated priorities verbatim so they carry forward without drift.

Two moves matter beyond filling the tiers:

Name what is not a requirement. Saying “EU data residency is not required” out loud eliminates an entire false-disqualification axis before research ever runs.
Sketch the demand picture, don’t pin it. A rough Mengengerüst — prose plus target amounts, no schema, no named knobs — captures the shape of demand. The actual Scenario parameters get pinned in the Phase-4 pricing pass, once candidate solutions reveal which cost shape each needs. Committing to seats/peaks/toggles now imports a specific (usually SaaS-shaped) cost model before any solution exists to cost.

How it fails: mistaking a preference for a hard filter (or asserting a hard filter that was never actually a constraint for this user), which silently disqualifies good candidates; preferences that never differentiate (every tool scores 4–5) and so are dead weight — sharpen the rubric or remove them; a compound filter that bundles several requirements into one (“must be their whole web presence” hiding discovery + trust + booking), which lets a tool pass by clearing only one; and carrying parallel “non-functional” or “risks” lists alongside the criteria, which is exactly the bookkeeping single-placement replaces. The conversational instrument that elicits all of this — the recurring-requirements checklist, the holistic-then-criterion-by-criterion sign-off — is owned by requirements-interview.md.

Phase 3 — Architecture: why it precedes the market

Phase 3 partitions the solution space into a few coherent architectures and surfaces the forks (decision points) whose positions generate them — the structural findings that determine which architectures are viable before any market data is gathered. It runs in two movements: the systems analysis that yields the forks, then the candidate architectures drawn from them. (For a pure single-tool pick the forks are light or absent; the phase shrinks accordingly.) The fork-as-unit, the resolution modes, and the architecture-is-not-a-solution rule are owned by model.md; here is why each output is worth producing.

The coupling finding is the most valuable output. It tells you which entities must share relational state for the core workflows to function. If a workflow must traverse several related entities in one query, a solution that stores them across separate tools with async sync is structurally fragile — it breaks lookup, prefill, and triggers at the seams even if each individual tool is excellent. Naming this eliminates architecture classes, not just tools, before the landscape survey runs. Name which entities form the spine (must share one store) and which are loose.
Key workflows. The 2–4 highest-stakes flows through the domain, including the Phase-1 stress-test scenario, with the data traversed at each step and the human-in-the-loop / AI-insertion points. These double as the verification guide for the next client meeting: every unverified assumption is a question to ask.
The decision forks are the durable output. Name the degrees of freedom whose positions generate the architectures — who owns the system of record, whether a knowledge layer is native or separate, whether a core component is rented or owned. Each fork carries a resolution mode that names which later phase closes it (fact at 2, rating at 4, value at 5) and a leverage (how much it narrows). Forks are discovered throughout the engagement — Phase 4 surfaces more — but here is where you first go looking; place each later-found one at the phase that resolves it. Discovery ≠ placement (the rule is model.md’s; the consequence is this phase’s: architecture leads with the architectures and names the forks, but does not parade their interactive lever — the space must come before the lever that reshapes it).

The second movement enumerates the architectures — tool-agnostic forms, each a coherent path through the forks, drawn before any market scan. An architecture is not yet a solution: the loop (Phase 4) fills it with concrete tools + build + glue and scores it into a solution; a custom-built piece is a first-class component there, not a fallback. Include anchor candidates — architectures expected to fail, kept to verify a structural finding rather than to compete (the fragmented best-of-breed to test whether your coupling analysis really rules it out; the full custom build to mark the build ceiling). Mark them as anchors so they don’t distort the shortlist; the elimination itself is evidence. Enumerate the architectures fresh for this engagement, fold in the researcher’s hunch as one branch but generate the alternatives alongside it, and note which you expect to fall and why — how a composition fails is often the most useful structural finding.

How it fails: using the phase to relabel Phase 2 criteria as component names (“follow-up management” → “Task/Reminder Module”) — a 1:1 mapping of use cases to modules is circular, it adds no topology the criteria didn’t already have. If the output could have been produced by scanning the criteria linearly, the phase hasn’t done its job. The deeper failure is skipping the phase: done implicitly, an unexamined architectural assumption becomes the wall you hit on iteration three, and broadening turns into a panic rescue instead of a planned move. Label everything HYPOTHESIS until the client meeting verifies it, and keep structural findings falsifiable (“a fragmented spine will break relational lookup” is testable; “complex systems need tight integration” is not).

Phase 4 — Solutions: why the loop, and how it narrows

The core of the engagement — where architectures get filled with tools and scored into solutions (model.md: the composition is the scored unit), and where further forks surface (a cost or capability found here can open a decision Phase 3 couldn’t see). One iteration is landscape survey → (deep dive + red team) → clarification when needed; it exits with either a clear winner (→ Decisions) or no winner (→ broaden, → another iteration). Most engagements need only one iteration. Each stage is a separate turn because each is a research hand-off (the hand-off discipline and the brief skeletons are owned by research-briefs.md); the loop narrows a shortlist, it does not crown a winner early.

Narrowing is a palette of moves, not a fixed pipeline. Survey → deep dive → red team names the activities; it does not prescribe their order once the long list exists. After the survey you hold a palette of investigative moves — specify a candidate’s composition (its coverage), estimate its cost, estimate its build/integration effort, assess its risk, assess its fit — and at each step you run the one with the most leverage right now, the one that most reduces uncertainty about which solutions are real. Some candidates fall away by a failed gate, some by a consultant judgment call; there is no canonical order (sometimes one risk finding kills three candidates before any costing is worth doing). The mechanics of the moves and fork-discovery during narrowing are in model.md § Narrowing.

The loop is operational: the gaps are the work-list. Build the long list per architecture and initialise each candidate empty — all five facets offen, no scores trusted yet. Which move to run next is then not a matter of feel: it is the open facet whose closing would most separate the live candidates, read off the knowledge grid by information gain (model.md § The knowledge grid). So the iteration is concretely init empty → fill the highest-gain open cell → re-rank → collapse, repeated until the survivors differ only on a handful of forks (the collapse shape Phase 5 wants). A gap that turns out to hide a value-fork becomes a 🎯 client move, not a research one. The Arbeitsbrett (the per-architecture board) and the Schärfe move-rail that render this grid are cockpit.md; the infoGain compute stays template-level until it generalizes across engagements.

Bookend every research workflow with sign-off, and keep the raw. Before a dynamic research workflow runs, present its plan — its structure and how many agents it fans out to — for approval; when it returns, present the results for approval before any write-back (the bookend doctrine is model.md § The sign-off bookend; the brief hand-off and the ingest/audit procedure are research-briefs.md). Write-back may move scores, knowledge-states, cost, and findings but never gates or authored verdict prose, and a tool research found weaker becomes a red-team caveat + lower score, never a fabricated gate-0. Persist every round’s raw output verbatim to research/ the instant it returns (it often sits in ephemeral scratch that’s lost on recycle) — the structured matrix and the curated findings are distillations, not replacements. Any throwaway script written to wrangle a round is scratch: keep the raw research, discard the scripts.

The stages, and what each is for

Landscape survey. From a universe of 30–50+ candidates per architecture branch, produce a shortlist of 5–10 that pass the hard filters, plus a structural finding about the category, plus the explicit list of who was eliminated and on which filter. A branch’s core is a family of interchangeable products, not one tool — the shortlist is that family catalogued. No ranking, no winner, no preference scores — that’s the deep dive. In a solution-design engagement the survey’s real yield is architectural: the per-category finding feeds back to the Phase-3 forks and architectures, confirming, weakening, killing — or revealing a position the Phase-3 set missed (a market gap is itself a finding). The architecture set is provisional both ways: a survey can demote a branch and surface a new one. A position the survey adds or promotes earns a first-class survey at the same depth — never carried forward on the incidental mentions from the brief that happened to find it. And spend survey depth in proportion to a branch’s current live-ness, not its round-1 prominence: the first pass is a bet, so keep it cheap enough to re-bet, then rebalance depth toward the branches that survived. Closing the architecture funnel is the point; the narrowed tool list is a by-product. (Ehimare: R1 spent its deepest survey on the broker-CRM branch R2 then demoted, while the generic-CRM position that became the favourite was never enumerated in Phase 3 and stayed under-surveyed for two rounds.)
Deep dive. Head-to-head feature evaluation of the shortlist against the stress-test use case, walking every candidate through the same scenarios. It populates the Fit axis and picks the paper-winner.
Red team — run in parallel with the deep dive, not after. Framed as “what does feature-comparison miss?”: recent negative reviews (last 12–18 months), acquisition or leadership turmoil, pricing traps (geographic exclusions, tier paywalls hiding required features), hidden disqualifiers, stability signals, real user pain in forums. Most of this lands on the Risk axis. The deep dive picks the paper-winner; the red team determines whether the paper-winner survives contact with reality — and the decisive findings often come from here, not the feature comparison.
Roll tools up to solutions. Tool scores are inputs; the unit you rank is the solution (composition). After the deep dive, assess each candidate solution as a whole — its Fit (does the composition cover the use cases, including the custom piece?), its Risk (integration fragility and vendor-count/dependency risk grow with every tool and every line of glue), its Cost. A two-tool stack can out-fit any single platform and still lose on integration risk; that trade-off only shows up at the solution level, so score it there.
Pricing / TCO runs as its own pass after scoring. You need costs before you can recommend, but cost must never contaminate the Fit score — so pricing runs after Fit is fixed, as a separate axis. The 4-bucket method and its reasoning are owned by pricing-tco.md; the schema fields by contract.md. The protected rule lives here too: Fit is scored with no knowledge of price.
Clarification — only when there’s no clear winner (two or three finalists within ~5% on the matrix, and the 1–5 scale can’t separate them). Re-score all the tied finalists symmetrically on the full rubric — including any criterion discovered since Phase 2, back-scored across every candidate — rather than building the case for the current leader. Run the head-to-head on the gap (one persona, one workflow), but let the symmetric re-score, not a quiet qualitative tiebreak, move the order. Some of what separates finalists is genuinely experiential (which UI feels right) and belongs to the user trying them, not to your matrix; surface those rather than scoring them. Never re-litigate settled questions.

Eliminations are recorded, never deleted. Every eliminated candidate stays in the record with the reason it fell — in the matrix as a hard-filter 0, or in the loop log. Eliminations get revisited when criteria shift, so a struck-through candidate is data, not absence — it’s the first place a later round looks. This governs the research output too: when a landscape report hands you a ranked list or a named winner, discard the report ranking and treat the shortlist as unranked until the deep dive scores it (the report’s job is to find and filter, not to decide; the ingest/audit procedure that enforces this is owned by research-briefs.md).

How it fails: doing several stages in one breath (survey + score + recommend), which produces confident mush — the failure mode the whole skill exists to prevent; anointing a #1 in the first pass instead of leaving a 3–5 candidate shortlist; letting the research report’s ranking stand; or scoring tools instead of solutions, so the integration/vendor- count trade-off never surfaces.

Stamp what moved, flag what now rests on it (the judgement hierarchy)

A Phase-4 finding doesn’t just update a score — it can move a standing judgement one or two layers up (model.md § The judgement hierarchy). So when a finding lands and materially changes a child — a gate flip, an elimination, a score crossing a tier boundary, or a new finding tagged to a dependency:

Stamp the child’s revisedIn with the current round — on the touched solution (solutions*.json) and on the finding (rounds[].findings[].revisedIn, research-index.json). Cosmetic edits don’t get stamped (or the badge cries wolf).
Any architecture or comparison verdict whose verdict.basis[] names that child now reads stale — the cockpit shows “⚠ neu zu prüfen — <dep> hat sich in Runde N geändert” (kernel staleness() detects it; nothing is rewritten).
Re-examine and re-stamp deliberately. A human/agent re-forms the standing verdict and bumps its verdict.asOf to the current round (clearing the badge), or confirms it still holds and re-stamps anyway. The badge is a prompt to look, never an auto-edit.

This is the in-phase, append-not-rewind half of regression; the distinct escalation half (a finding that invalidates a requirement) is below.

Phase 5 — Decisions: why 2–3 solutions, not one

Collapse the accumulated research into a recommendation the customer can act on, and close the open forks: the 🎯 value-forks return to the client here (📊 rating-forks were settled by Phase 4’s scoring, 🔍 fact-forks by a Phase-2 datum). Don’t present one answer — present 2–3 solutions that lead to different outcomes (typically a recommended path, a compromise path, and a minimal-change path), each with its composition, an architecture sketch, the 4-bucket TCO band (not a single sticker price — pricing-tco.md), implementation path, what it optimises for, what it trades off. Present the comparison across the four axes — Fit and Risk scored, Cost as a band whose width reads as cost risk — never folded into one blended number.

The shape to aim for is 2–3 solutions + 2–3 open decisions, reached by collapsing the survivors once they differ only on a handful of forks (their fork positions, model.md): siblings that vary on a single decision become “one solution family + that open decision,” which is exactly the briefing shape. If too many axes still separate them to present cleanly, that’s the signal to run another move or another round — not to force the pick.

The instrument informs; the consultant authors the verdict (P1 below). The weighted ranking and its named weighting presets are a transparent, turnable instrument — show the client your weighting and let them watch the order move — but the verdict is yours to write, framed against the customer’s stated values (the verbatim Phase-2 quotes), risks named not glossed, and free to diverge from whatever currently sits on top; when it does, say so and say why. The old rule “never blend into one score” survives in its true form: never let an opaque blend stand in for your judgment.

How it fails: presenting one answer as if research dictated it (it informs, it doesn’t dictate); folding the four axes into a single ranked number that hides the trade-off; forcing a pick when the survivors still differ on too many axes; or letting the instrument’s top row be the verdict instead of authoring one.

Regression — absorbing change without rewinding

The funnel is re-entrant. Priorities shift, a new requirement surfaces, a stakeholder reframes the mission — and the method must take it in without discarding what it learned. The discipline:

Append, don’t rewind. A new investigation is a new research round, logged in research-index.json rounds[] (what it asked, what it found, how it moved the picture). The log only grows; you never rewrite an old round.
Update the judgments in place. A new criterion is added to criteria.json and back-scored across every surviving candidate — an unscored criterion that never touches the totals is a rescore that changes nothing. A shifted priority is a re-weight. Eliminations are recorded as a hard-filter 0, never deleted.
The live surfaces show the latest; the Report is the history. Every live surface renders current truth — the Reise stations the current mission/criteria/architecture, the Exploration screens (Arbeitsbrett, Verdikt, Cockpit) the current scores, shortlist, and fork positions. The Report screen is the phase-by-phase log of how the engagement learned — history lives in the Report, current truth lives in the live surfaces. (The rounds[] data history and the optional loop-log.md prose nav aid are distinct granularities — contract.md.)
Bubble the change up the judgement hierarchy. A regression that moves a child stamps its revisedIn, which flags every standing architecture/comparison verdict resting on it (basis[]) as neu zu prüfen — the soft, in-phase signal that a higher-layer judgement may no longer hold (the stamp-and-flag protocol is in § Phase 4 above; the detector is kernel staleness()). This is detection, not re-derivation: the human re-examines and re-stamps verdict.asOf. Footgun: if you rename a solution key or finding id, fix every basis[] that referenced it — an orphaned ref silently stops matching and the verdict looks fresh forever (contract/validate.mjs warns).
A research-initiated requirement change is escalated, not silently absorbed. Re-deriving forward applies when a later-phase finding moves a fork, a rating, or a candidate. But when a finding invalidates or materially changes an earlier-phase requirement (Phase 1 use case or Phase 2 criterion) — e.g. a survey shows a stated ambition is unbuildable or carries unacceptable liability — STOP and escalate to the consultant/user before re-deriving. A requirement change is theirs to confirm, exactly like a value-fork; the agent may propose it (with the finding that forced it) but never quietly rewrite the requirement and recompute. (Ehimare R2: a survey finding that customer-facing AVB-citing carries hallucination liability changed the requirement; it was reframed as an internal-only helper — a requirement move that needed sign-off, not an absorb.)

So a regression doesn’t reset the engagement to an earlier phase; it adds material at whatever phase the change touches (a new requirement re-enters Phase 2, a new architecture re-enters Phase 3) and re-derives forward. Because the phases are data files, not code branches, re-entering one is editing that file and letting the rest recompute.

Broadening is the within-Phase-4 special case of regression. When an iteration exits without a winner — no architecture produced an acceptable candidate, or every finalist dies on the same dimension on the same axis — the architectural assumption is wrong, not the candidates. “Stuck is information.” Broaden along three streams: gap-check (did a hard filter over-eliminate?), adjacent category (a market not surveyed?), alternate architecture (invert the assumption). Propose the direction, stop for sign-off, then start the next iteration with a fresh landscape survey on the broadened scope. Broadening is triggered by a dead loop and stays inside Phase 4; regression is the general case, triggered by changed inputs at any phase. Same mechanic (append, re-derive), different trigger.

The five Principles

Architecture before tools. The first decision is the shape of the stack, not the products. Phase 3 exists to make that decision before the market scan can bias it.
Generate then eliminate. Every iteration of the loop produces candidates and eliminates via trade-off analysis. Scoring is what makes the elimination defensible.
Stuck is information. When the loop hits a wall, the architectural assumption is wrong — broaden, don’t grind on the same path.
Stakeholder comfort is a decisive criterion. Tool approachability for the actual lead users is weighted alongside power-user features, not below them.
Scoring drift is signal. When the quantitative ranking and the qualitative read diverge — the scores rank A first but the verdict prose names B — investigate the underlying assumptions before reconciling. Don’t silently smooth it over.

The phase-gating discipline that holds across all five

After every phase: STOP, present what you produced, state the decision the human needs to make, and wait for sign-off before the next. This is not a single-shot task — it is a staged engagement across many turns and usually several chats, and the single most common failure mode is an agent doing several phases at once. The stop is the discipline that makes the output trustworthy; if you ever notice yourself about to “also go ahead and…,” that’s the failure mode, and you stop there. (The same rule applied inside a step is P1’s verdict doctrine, owned by model.md: the consultant authors the verdict, the agent presents the picture and the defects it found, and never eliminates a candidate or names a winner on its own.)

← Alle Methodik-Themen