Methodik

Research briefs — the Phase-4 hand-off

references/research-briefs.md — direkt aus der Skill-Doktrin gerendert.

Phase 4 runs as a loop, and every stage of that loop is driven by a brief. Brief quality drives research quality. This file owns the hand-off end to end: how you write the brief, the stage templates you fill, how the human runs it, and how you ingest and audit what comes back. For why Phase 4 exists and how it fails (the loop, the deep-dive + red-team pair, regression, broadening), see process.md; for the funnel and the four axes those briefs serve, see model.md.

The research workflow

Run each brief as its own research task — one brief = one researcher, fanned out in parallel for breadth (seed-all, prune-on-gate). Match the mechanism to the goal: breadth (a 30–50-candidate landscape, a multi-capability filler sweep) → parallel narrow tasks; depth on one contested claim (a red-team dossier, a finalist clarification) → a single focused run. However it’s run, each brief must be self-contained — it runs cold, with zero context from this session — so embed the org, the scenario, the requirements verbatim; never link to something the researcher can’t see.

Get the workflow plan signed off before you run it, and run in deliberate waves. Present the shape of the fan-out — which briefs, how many researchers — for the consultant’s approval before launching; that is the front half of the sign-off bookend, the results-approval-before- write-back being the back half (model.md § The sign-off bookend, process.md § Phase 4). Prefer a few reviewable waves over one uncontrolled blast: a wave you can read, correct, and re-aim beats a fan-out so wide its output can’t be audited. This is a principle of control, not a hard cap on agent count — size each wave to what you can actually review before launching the next.

Save every brief to briefs/, named by stage and iteration (e.g. briefs/01-landscape-crm.md). When a round returns, persist its raw output verbatim to research/ immediately — research output often sits in ephemeral scratch that’s lost on session/container recycle, so capture the raw before it evaporates. Keep three fidelities, all of them: the raw report verbatim (research/), the structured matrix-*.json, and the curated findings (research-index.json / synthesis) — never let the distillation be the only copy. The brief trail and the saved raw are the audit record of what was asked and what came back; ingest as below.

Briefs research tools, not solutions

The market gives you tools — building blocks; the briefs below survey and score those. The solution layer — which tools you compose, what you build, how much the integration costs — is your design and estimation work, not a research hand-off. Size build and integration effort as work-package bands yourself (the method is in pricing-tco.md, the field shapes in contract.md), informed by the tool research but never outsourced to it. A research tool can tell you what a tool does and costs; it cannot tell you which composition wins under your Scenario — that is the consultant’s judgment (model.md § verdict doctrine).

The brief skeleton

Every brief, whatever the stage, follows this eight-part shape:

Why this research exists — one paragraph of purpose.
Context — embedded, not linked. The org, the stress-test scenario, the relevant requirements. Enough that the research tool needs nothing else.
Your task — verb-led, scope-bounded. What to produce, how many candidates, what shape the answer takes.
What is settled (do not redo) — additive-only framing. Name prior artifacts concretely (“the CRM red-team report,” not “prior research”).
What to investigate — listed and prioritised.
Hard filters / scoring rubric — explicit, with the tier weights and what separates a 3 from a 5.
Caveats on prior data — what to verify, what to trust.
Output format — specified and named, matching the matrix schema (contract.md § matrix-<category>.json) so ingestion is mechanical.

Sharpening rules that keep briefs from drifting:

Quote the customer’s stated values verbatim — same words every time.
Each brief runs in a fresh chat with no project context required.
Late-loop briefs (clarification) must never re-litigate settled questions. If you catch yourself writing “evaluate X,” check whether you mean “clarify which aspect of X.”
Distrust umbrella category labels. Before surveying, check whether the category name hides a gradient that could change the verdict (“low-code” spanning no-code config → light scripting → developer-orchestration). If it does, name the sub-taxonomy up front and make the brief bucket candidates onto it — an unsegmented survey averages away the distinction that decides. (Ehimare R3: “low-code” split into a No-Code→Low-Code→Orchestration ladder; the reminder-without-scripting rung was the real divider.)
Findings are perishable — date them and record the flip-condition. Every elimination and standout carries the date it was checked and the one fact that would reverse it (“out on no open API — revisit if they ship one”), and is re-validated, not assumed, next round. Landscape facts have a short half-life: vendors sunset APIs, OSS ships the missing feature, pool rules change. (Ehimare R3: Tyntec’s on-prem edge had been killed by a Meta API sunset; an earlier “Baserow is out” was overturned by a release.)

Stage templates

Fill the brackets from the engagement. Keep the output-format section aligned with the matrix schema in contract.md so returned reports drop straight into the matrix.

Landscape survey

Structure the survey per architecture/position, not per capability layer (C10). Run one landscape brief per Phase-3 shape/position (the [architecture branch] in the title) — survey the market that could be that whole architecture — plus one cross-cutting brief for the shared fillers/Bausteine that any position may dock (the AI-assist bricks, the channel, the glue; skeleton below). Splitting instead by capability layer (“CRMs” vs “AI knowledge” vs “channels”) fragments each architecture across briefs and hides the cross-position findings — e.g. whether the platforms that can be licensed are the same ones that can dock the fillers. (Ehimare R1 ran by capability layer and missed the dock-paradox; the R2 by-position resurvey surfaced it.)

# Landscape survey — [category], [architecture branch]

## Why this research exists
We are selecting [category] for [org, one line]. This is the broad scan: from the
full market, find the candidates worth a closer look and tell us what shape the
market has.

## Context
[Org profile. The stress-test scenario. The hard filters verbatim. The stated
priorities verbatim.]

## Your task
Survey the [category] market broadly — aim to consider 30–50+ products. Apply the
hard filters below. Return a shortlist of 5–10 that pass, each with a
one-paragraph profile. Also return: (a) the explicit list of who you eliminated
and on which hard filter, and (b) one structural finding about this market — the
underlying tension or gap that shapes the whole category.

## What this brief is NOT — do not do any of these
- Do NOT rank or score the shortlist, or call any candidate "best", "the fit",
  "recommended", "the lead", or a "winner".
- Do NOT recommend a pilot, a decision, or a next step.
- Do NOT score any preference / weighted criterion — that is a separate later sprint.
- Do NOT narrow toward a front-runner. Every tool that clears the hard filters is an
  equal member of the shortlist at this stage.

Surfacing a winner now is the single most common way this sprint goes wrong, because
LLM research is trained to conclude. Resist it — this pass only sorts who
clears the gate.

## Hard filters — judge each one LITERALLY (binary; one failure eliminates)
Below is the exact text of each hard filter. Use it verbatim; do not paraphrase or
tighten it.
[SH-H1 …]: [full criterion text, including any "or" branches]
[…]

How to judge them:
- A candidate PASSES if it meets the written condition, **including any "or" branch**
  (e.g. "EU/EEA **or** SCC + TIA"). Never raise the bar above the text.
- Degree and quality differences — how good it is, whether it's the default, which
  pricing tier it needs, how transparent the vendor is — are NOT grounds to fail a
  binary filter. Note them as evidence for the later scored criteria, but PASS the gate.
- Apply the identical standard to every candidate. If two candidates have the same
  facts they MUST get the same verdict; same-facts-different-verdict is a bug.
- There is no "conditional pass". A candidate that meets the bar with a caveat is a
  PASS plus a noted weakness for scoring — not a fail and not a maybe.

Worked example: a filter reads "provides an Art. 28 AVV; EU/EEA processing OR SCC + TIA
for a third country." A US-headquartered vendor that offers an AVV and covers US
transfers via SCC/DPF **passes** — its non-default EU residency is a depth demerit
scored later, not an elimination. Failing it for "not EU-hosted by default" enforces a
stricter rule than the text, and is wrong.

## What to investigate
[Prioritised list tied to the critical preferences.]

## Output format
List the shortlist in a neutral order (alphabetical is fine) — **not ranked**. For each
shortlisted tool: name, one-paragraph profile, and pass/fail against each hard filter
with a one-line reason. Then the eliminated list (tool → the single failed filter +
one line of evidence). Then the structural finding (a short paragraph). No summary
recommendation, no "top pick".

Cross-cutting fillers survey (run alongside the per-position landscape briefs)

The companion to the per-position briefs (C10): one brief for the shared Bausteine any architecture may dock — so each position’s brief can stay focused on the spine, and the cross-position question (which spines can actually carry which filler) is asked once, explicitly.

# Filler survey — [filler/Baustein, e.g. "AVB knowledge layer" | WhatsApp channel | doc pre-fill]

## Why this research exists
Across the candidate architectures, every position may need [filler]. This brief surveys how
[filler] can be provided — bought, rented, or built — and, critically, **what it needs from the
spine to dock** (open API? webhook? data export?), so we know which architectures can actually
carry it.

## Context
[Org profile. The stress-test scenario. The hard filters verbatim that apply to this filler —
e.g. data residency, liability. The use case(s) it serves verbatim.]

## Your task
Survey the ways to provide [filler] across **four delivery modes — SaaS (rent) · self-hosted
commercial (buy) · open-source (self-host) · self-built (build)** — and treat **open-source/self-host
as its own mandated seed bucket** (results skew to SaaS, so an un-mandated survey silently drops the
OSS half, often the sovereignty/independence answer). For each, return: a one-paragraph
profile, pass/fail against the hard filters, and — the point of this brief — **its docking
requirement** (what interface/openness it needs from a host spine). Also return one structural
finding: the cross-cutting tension (e.g. "the providers that can be licensed standalone are the
ones that DON'T expose a dockable API").

## What this brief is NOT — do not do any of these
- Do NOT rank/score or name a "best" filler. Do NOT pick which architecture should win.
- Do NOT score weighted preferences — that is a later sprint.

## Output format
For each option: name, one-paragraph profile, pass/fail per hard filter, and its docking
requirement. Then the eliminated list (option → failed filter + evidence). Then the structural
finding (a short paragraph) — especially any spine↔filler compatibility constraint.

Deep dive (run in parallel with the red team)

# Deep dive — [category] shortlist

## Why this research exists
The shortlist is set. Now score these [N] tools head-to-head against what actually
drives our decision, walking each through our real scenario.

## Context + what is settled
Shortlist: [tools]. These already pass hard filters — do NOT re-litigate filters.
Stress-test scenario every tool must be walked through: [scenario, verbatim].

## Your task
For each tool, score 1–5 on every preference criterion below and give 2–4 sentences
of reasoning per score citing specific evidence. Walk each tool through the
stress-test scenario explicitly.

## Scoring rubric (1–5; state what separates a 3 from a 5)
Critical (×5): [CR-C1 …] — [rubric]
Standard (×3): [...]
Nice-to-have (×1): [...]

## Pricing (structured — for the TCO model, not a sticker price)
For each tool report its **pricing model and parameters**, not just a number:
- the charging model (per-seat / per-resolution / per-conversation / AI-credits /
  tiered-with-caps / self-host);
- the per-unit rates and the plan **tier that unlocks the AI features**;
- caps / included allotments before overage, and any language/usage caps;
- one-time onboarding/integration costs;
- whether each rate is **verified** (from the vendor's pricing page, link it) or
  **estimated**.
This is what a TCO model consumes downstream; a single monthly figure is not enough.
(The TCO method is in `pricing-tco.md`; the `pricing` field shape in `contract.md`.)

## Output format
A scored matrix: tool × criterion, each cell = score + reasoning sentence. Plus,
per tool: a one-line verdict, top strengths, top weaknesses, and the structured
pricing block above.

Red team (run in parallel with the deep dive)

# Red team — [category] shortlist

## Why this research exists
Feature comparison picks a paper-winner. This brief decides whether the
paper-winner survives contact with reality. Be adversarial.

## Context + what is settled
Shortlist: [tools]. We have a feature deep-dive running separately — do not repeat
it. Hunt for what feature-comparison misses.

## Your task
For each tool, investigate and report:
- Negative reviews in the last 12–18 months (G2, Capterra, Trustpilot, Reddit,
  forums) — quote real user pain.
- Acquisition, funding, or leadership turmoil; product-sunset signals.
- Pricing-structure traps: geographic exclusions, tier paywalls hiding required
  features, surprise onboarding fees, seat-count cliffs.
- Hidden disqualifiers against our hard filters that a sales page would hide.
- Stability / reliability signals (outages, status-page history).

## Output format
Per tool: a risk verdict (does it survive?), the specific findings with sources
and dates, and any criterion scores that should be revised downward as a result.

Clarification (only when no clear winner)

# Clarification — [finalist A] vs [finalist B] on [the gap]

## Why this research exists
[A] and [B] are within ~5% on the matrix and the 1–5 scale can't separate them.
The deciding question is [the specific gap], not a re-evaluation.

## Context + what is settled
Everything except [the gap] is settled — do not re-score the rest. The single
persona and single workflow to test: [persona] doing [workflow].

## Your task
Side-by-side walk-through of [A] and [B] for [persona] doing [workflow]. Where
exactly does each help or get in the way? Resolve the gap.

## Output format
A direct comparison on the gap, a recommendation between the two, and which
criterion scores (if any) should move.

Broadening (only after an iteration finds no winner)

Broadening doesn’t have a single brief — it produces the scope for the next iteration’s landscape survey. Broadening is the within-Phase-4 special case of regression (see process.md): first propose the broadening direction to the human (gap-check / adjacent category / alternate architecture) and get sign-off, then write a fresh landscape-survey brief on the broadened scope.

Refinement requests

After ingesting results you’ll often need a small follow-up — re-verify a price, chase one finalist’s API limits, confirm a sunset rumour. These are mini-briefs: same skeleton, tightly scoped to the one open question, with “what is settled” doing the heavy lifting so the research tool doesn’t redo work. Keep them in briefs/ too — the trail of what was asked is part of the audit.

Ingesting a returned research report

When a research round returns, this is the single procedure for turning it into matrix data. Do not skip the audit step.

First, audit the report — it is biased input, not ground truth (the same stance the skill takes toward user-dropped documents, extended to research output). LLM research reliably:

volunteer a ranking / winner / pilot recommendation even when the brief asked only for a shortlist — discard the ranking; the report’s offered winner is thrown away, and in a landscape pass the preference scores stay empty. This is the same anti-ranking discipline the landscape brief’s NOT-list enforces on the way out; enforce it again on the way in.
over-apply hard filters, importing a stricter rule than the criterion text — re-check each elimination against the literal filter text, including any “or” branch, and re-qualify anything failed on an unstated rule;
judge inconsistently — scan for same-facts-different-verdict pairs and reconcile them to the criterion text;
parrot vendor marketing — treat claims as claims and flag what still needs verifying.

Correct these before you ingest, and tell the user which verdicts you overrode and why. Eliminations you uphold are recorded as a hard-filter 0, never as row-deletion (see the scoring discipline in contract.md). Only then:

Read it against the criteria IDs. For each tool × criterion, extract the score and a tight reasoning sentence.
Write/merge into the relevant matrix-<category>.json (schema in contract.md). Don’t overwrite a human-verified field with a fresh research guess without flagging it.
Fill qualitative for every tool the report covers — including eliminated ones.
Append a round to the research log — what this round asked, what it found, where it leads (the append-only rounds[] in research-index.json, and/or the optional loop-log.md prose nav aid; their distinct roles are documented in contract.md).
Reload the explorer and report what moved — new eliminations, rank changes, rescores — rather than dumping the whole matrix. Then stop for the next call.

Always validate after writing. A truncated or inconsistent file breaks the cockpit silently, so don’t eyeball it — run the system validator and fix any errors before you report back or move on:

node contract/validate.mjs <engagement>/data

It verifies every JSON file parses, criteria have valid tiers, scores reference real criterion IDs, hard filters are 0/1, preferences are 1–5, and flags scored cells with no reasoning note. Non-zero exit = fix it first.

← Alle Methodik-Themen