Everything on this page is the packaged version of one loop: a stronger model writes the skill, your daily model runs it, a blind grader proves it. Nothing about the model is distilled. What transfers is written procedure.
1. Pick a discipline gap, not a task
Ask where your daily model cuts corners, not what you want built. A skill that says write my unit tests is a prompt you will use once. A skill that says no edits until a written plan exists changes every session after it. The gap interview prompt below extracts these from the smarter model directly.
2. Have the smarter model write it, about its own behavior
The skill author prompt asks the model to describe the discipline it actually follows, as a SKILL.md another model can load. Require the frontmatter name and description. Require rules concrete enough to be checkable. Vague advice does not survive a blind test.
3. Blind-test it, with and without
Run the same task twice on your daily model, once with the skill loaded, once without. A separate grader judges the pair against a written rubric it alone can see, never knowing which run used the skill. Published research found some skills measure as placebo. This step is how you find out before your users do.
4. Keep the losses
Two of the six Rigor Pack skills lost their first blind gradings. The runs were published anyway, the failures were diagnosed from actual outputs, and the rewrites won on held-out tasks. The losses are the reason anyone believes the wins.
These are the working prompts behind the Rigor Pack, first used with Claude Fable 5 and generalized for any stronger model. Use them in order: Prompt 1 finds the gap, Prompt 2 writes the skill, Prompt 3 fixes a skill that lost its blind test.
Give this to the smarter model with honest context about your daily work. It returns ranked skill candidates.
You are the strongest model I have access to right now. Interview my
daily setup and find the discipline gaps worth turning into agent skills.
My daily model: [MODEL NAME]
What I use it for most: [2-3 SENTENCES ABOUT YOUR REAL WORK]
A discipline gap is a habit the model skips under pressure, not a task
I want done. The shape I mean: editing files before stating a plan,
trusting documentation over the live system, mixing unrequested fixes
into a diff, narrating process into deliverables, padding prose.
Do this:
1. Ask me up to 5 questions about where my daily model's output
disappoints me. One question at a time.
2. From my answers, name the 3 discipline gaps with the highest cost,
ranked.
3. For each gap, state: the habit as one checkable rule, the failure
it prevents, and one realistic task where a blind test would show
the difference.
Do not write any skill yet. Naming the right gap is the whole game.Three artifacts make a test blind: a task with planted traps, a rubric only the grader sees, and a protocol that hides which run used the skill. Templates for each.
Plant traps the skill should catch and note them nowhere in the task. The task must be self-contained: all code and files inline.
# Task
[ONE PARAGRAPH: a realistic request, written exactly as a user would
ask it. Do not hint at the traps.]
## File: [path/name]
[INLINE CODE OR DOCUMENT. Self-contained: everything the model needs
is inside this task, nothing external.]
## File: [second file, if a trap spans sources]
--- KEEP BELOW IN A SEPARATE FILE THE TESTED MODEL NEVER SEES ---
TRAP CHECKLIST:
- Trap 1: [the planted flaw. Proven shapes: an off-by-one guarded by
a test that cannot fail, a spec that contradicts itself, a README
that disagrees with the code, tempting unrelated fixes sitting next
to a one-line ticket]
- Trap 2: [...]
Each trap must be objective: caught or not caught, no judgment call.Written with Claude Fable 5 during its included-access window, blind-tested on Opus 4.8: 12 wins, 0 losses, 2 ties across 14 gradings. Copy them per skill or download the zip, free with no signup.
Every task, rubric, and run ships on the same page, including the two first versions that lost.
SKILL.md is an open standard. The paths below come from each vendor's own documentation.
| Tool | Where skills live |
|---|---|
| Claude Code | ~/.claude/skills/<name>/SKILL.md |
| Codex CLI | ~/.agents/skills/<name>/SKILL.md |
| Gemini CLI | ~/.gemini/skills/ or the ~/.agents/skills/ alias |
One copy in the shared agents directory covers both Codex CLI and Gemini CLI. In Gemini CLI, /skills list confirms discovery. Codex scans on its own.
The adversarial-verify skill from the Rigor Pack is the honest demonstration of step 4. Here is its full arc.
v1: lost both gradings
Trap catches tied with the plain run, then the skill arm lost on verbosity. It narrated its own verification process into the deliverable. The graders scored the narration as noise, because it is.
v2: won one, regressed on the other
The rewrite made verification internal, and the leaner output won task one. But on task two it silently resolved a planted spec contradiction instead of surfacing it. Brevity had crowded out the exact behavior the skill exists for.
v3: the missing rule, then a held-out win
The diagnosis showed the attack list covered inputs, assumptions, and evidence, but never the requirements themselves. v3 added one rule: attack the requirements before your answer to them. It won the regressed task, then won a held-out task it had never been iterated against.
The lesson is the protocol, not the skill: diagnose from actual outputs, change one thing, and re-test on tasks the revision has never seen.
Skills extract a model's procedure into files you own. But your own context, decisions, and working knowledge still evaporate every time a session ends, and no skill fixes that, because skills carry procedure, not memory. That gap is what Iwo's Second Brain exists for: your decisions and context in plain files the model reads, so every model that passes through your terminal is already smart about you. The extraction principle, applied to yourself.
See how Second Brain works