You are in

The Skill
Distillation Kit

Everything on this page is the packaged version of one loop: a stronger model writes the skill, your daily model runs it, a blind grader proves it. Nothing about the model is distilled. What transfers is written procedure.

Quick start

1. Pick one discipline gap where your daily model cuts corners.
2. Open Prompt 1 below with the smartest model you can reach today.
3. Have it write the skill with Prompt 2, then blind-test with the rig.
4. Keep what wins, publish what loses, revise with Prompt 3.

The Method

The four steps, one paragraph each

1. Pick a discipline gap, not a task

Ask where your daily model cuts corners, not what you want built. A skill that says write my unit tests is a prompt you will use once. A skill that says no edits until a written plan exists changes every session after it. The gap interview prompt below extracts these from the smarter model directly.

2. Have the smarter model write it, about its own behavior

The skill author prompt asks the model to describe the discipline it actually follows, as a SKILL.md another model can load. Require the frontmatter name and description. Require rules concrete enough to be checkable. Vague advice does not survive a blind test.

3. Blind-test it, with and without

Run the same task twice on your daily model, once with the skill loaded, once without. A separate grader judges the pair against a written rubric it alone can see, never knowing which run used the skill. Published research found some skills measure as placebo. This step is how you find out before your users do.

4. Keep the losses

Two of the six Rigor Pack skills lost their first blind gradings. The runs were published anyway, the failures were diagnosed from actual outputs, and the rewrites won on held-out tasks. The losses are the reason anyone believes the wins.

The Prompts

Three prompts, ready to paste

These are the working prompts behind the Rigor Pack, first used with Claude Fable 5 and generalized for any stronger model. Use them in order: Prompt 1 finds the gap, Prompt 2 writes the skill, Prompt 3 fixes a skill that lost its blind test.

Give this to the smarter model with honest context about your daily work. It returns ranked skill candidates.

Prompt 1: paste into the smartest model you can reach

You are the strongest model I have access to right now. Interview my
daily setup and find the discipline gaps worth turning into agent skills.

My daily model: [MODEL NAME]
What I use it for most: [2-3 SENTENCES ABOUT YOUR REAL WORK]

A discipline gap is a habit the model skips under pressure, not a task
I want done. The shape I mean: editing files before stating a plan,
trusting documentation over the live system, mixing unrequested fixes
into a diff, narrating process into deliverables, padding prose.

Do this:
1. Ask me up to 5 questions about where my daily model's output
   disappoints me. One question at a time.
2. From my answers, name the 3 discipline gaps with the highest cost,
   ranked.
3. For each gap, state: the habit as one checkable rule, the failure
   it prevents, and one realistic task where a blind test would show
   the difference.

Do not write any skill yet. Naming the right gap is the whole game.

The Test Rig

The blind-test rig

Three artifacts make a test blind: a task with planted traps, a rubric only the grader sees, and a protocol that hides which run used the skill. Templates for each.

Plant traps the skill should catch and note them nowhere in the task. The task must be self-contained: all code and files inline.

task-template.md

# Task

[ONE PARAGRAPH: a realistic request, written exactly as a user would
ask it. Do not hint at the traps.]

## File: [path/name]
[INLINE CODE OR DOCUMENT. Self-contained: everything the model needs
is inside this task, nothing external.]

## File: [second file, if a trap spans sources]

--- KEEP BELOW IN A SEPARATE FILE THE TESTED MODEL NEVER SEES ---

TRAP CHECKLIST:
- Trap 1: [the planted flaw. Proven shapes: an off-by-one guarded by
  a test that cannot fail, a spec that contradicts itself, a README
  that disagrees with the code, tempting unrelated fixes sitting next
  to a one-line ticket]
- Trap 2: [...]

Each trap must be objective: caught or not caught, no judgment call.

Worked Examples

Six skills that went through this exact loop

Written with Claude Fable 5 during its included-access window, blind-tested on Opus 4.8: 12 wins, 0 losses, 2 ties across 14 gradings. Copy them per skill or download the zip, free with no signup.

Open the Rigor Pack

Every task, rubric, and run ships on the same page, including the two first versions that lost.

Install

Where skills live, per tool

SKILL.md is an open standard. The paths below come from each vendor's own documentation.

Tool	Where skills live
Claude Code	`~/.claude/skills/<name>/SKILL.md`
Codex CLI	`~/.agents/skills/<name>/SKILL.md`
Gemini CLI	`~/.gemini/skills/` or the `~/.agents/skills/` alias

One copy in the shared agents directory covers both Codex CLI and Gemini CLI. In Gemini CLI, /skills list confirms discovery. Codex scans on its own.

The Worked Example

A losing skill, fixed in public

The adversarial-verify skill from the Rigor Pack is the honest demonstration of step 4. Here is its full arc.

v1: lost both gradings

Trap catches tied with the plain run, then the skill arm lost on verbosity. It narrated its own verification process into the deliverable. The graders scored the narration as noise, because it is.

v2: won one, regressed on the other

The rewrite made verification internal, and the leaner output won task one. But on task two it silently resolved a planted spec contradiction instead of surfacing it. Brevity had crowded out the exact behavior the skill exists for.

v3: the missing rule, then a held-out win

The diagnosis showed the attack list covered inputs, assumptions, and evidence, but never the requirements themselves. v3 added one rule: attack the requirements before your answer to them. It won the regressed task, then won a held-out task it had never been iterated against.

The lesson is the protocol, not the skill: diagnose from actual outputs, change one thing, and re-test on tasks the revision has never seen.

The bigger half of the same problem

Skills extract a model's procedure into files you own. But your own context, decisions, and working knowledge still evaporate every time a session ends, and no skill fixes that, because skills carry procedure, not memory. That gap is what Iwo's Second Brain exists for: your decisions and context in plain files the model reads, so every model that passes through your terminal is already smart about you. The extraction principle, applied to yourself.

See how Second Brain works

Back to the kit overview

The SkillDistillation Kit